Integrating document BLOB storage with SQL Server
NoSQL platforms can support highly scalable databases with BLOB attachments (images, documents, and other files), but if you think you need to embrace a NoSQL solution in lieu of SQL Server just because you have a high volume of BLOBs in your database, then think again. Sure, if you have good reasons to go with NoSQL anyway – for example, if you want the flexibility of schema-free tables, and you can accept the compromises of eventual transactional consistency – then NoSQL can fit the bill nicely.
But critical line-of-business applications often can’t afford the relaxed constraints of NoSQL databases, and usually require schemas that are strongly typed, with full transactional integrity; that is, a full-fledged relational database system (RDBMS). However, relational database platforms like SQL Server were originally designed and optimized to work primarily with structured data, not BLOBs. And so historically, it’s never been feasible to store large amounts of BLOB data directly in the database. That is, until FILESTREAM.
With FILESTREAM, Microsoft addresses the dilemma of storing BLOBs within the relational database. My new Pluralsight course, SQL Server 2012-2014 Native File Streaming, explains this innovative feature in detail, and in this blog post, I’ll discuss how FILESTREAM (and its related technologies) can be used to implement a highly-scalable BLOB storage solution that’s fully integrated with a relational SQL Server database. You’ll also find live demos on everything mentioned in this post in the course.
Although SQL Server was never originally intended to handle BLOBs in large scale, this is no longer true as of FILESTREAM (introduced in SQL Server 2008). Before FILESTREAM, SQL Server was forced to shove BLOBs into the standard database filegroups, which are really optimized for storing structured row data in 8k pages. Because BLOBs don’t fit naturally within this structure, they must be pushed into off-row storage, which bloats the structured filegroups, and ultimately kills performance.
FILESTREAM changes all that. First, to be clear, FILESTREAM is not actually a data type. Rather, it’s an attribute that you apply to the varbinary(max) data type, the same data type that you would use to store BLOBs directly inside the row. But by merely appending the FILESTREAM attribute to the varbinary(max) data type, SQL Server takes a radically different approach to physical BLOB storage. Rather than inundating the standard database filegroups with BLOBs, SQL Server stores BLOB content as files in the file system – where they belong; the file system being a native environment optimized for storing and streaming unstructured binary content. At the same time, it establishes and maintains reference pointers between the rows in the standard filegroups and the files in the file system that are tied to varbinary(max) columns in those rows. All this magic occurs behind the scenes, and is totally transparent to any existing code that works with ordinary varbinary(max) columns.
In this manner, the BLOB data is physically stored separately from structured row data, but it is logically an integral part of the database. So for example, backing up the database includes the BLOB data, with the option of performing a partial backup that excludes the FILESTREAM filegroup when you want to create smaller backups that don’t include BLOB data.
Furthermore, this solution provides full transactional consistency – because FILESTREAM integrates with the NTFS file system, and NTFS is a transactional file system. So when you start a database transaction and insert a row, and that row includes BLOB data stored in a varbinary(max) FILESTREAM column, then SQL Server automatically initiates an NTFS file system transaction over that BLOB data. Then, the fate of the file system transaction hinges on the fate of the database transaction. If and when the database transaction commits, then SQL Server will also commit the NTFS file system transaction; similarly, rolling back the database transaction automatically rolls back the NTFS transaction.
Accessing BLOBs with T-SQL
With FILESTREAM, you can treat BLOBs as ordinary varbinary(max) columns in T-SQL. For example, you can use the OPENROWSET function with the BULK provider to import an external file into a varbinary(max) column, and if that column is decorated with the FILESTREAM attribute, then SQL Server will automatically store a copy of that file as a BLOB in the NTFS file system behind the scenes, rather than force-fitting it into the standard database filegroups.
INSERT INTO PhotoAlbum(PhotoId, PhotoDescription, Photo) VALUES( 3, 'Mountains', (SELECT BulkColumn FROM OPENROWSET(BULK 'C:\Demo\Ascent.jpg', SINGLE_BLOB) AS x))
To retrieve BLOBs, it’s a simple SELECT:
Using SqlFileStream and the Streaming API
Although FILESTREAM delivers scalable storage by leveraging the NTFS file system behind the scenes, BLOB access needs to scale as well. It’s great that you can enjoy total transparency by just using T-SQL access, but stop for a moment and think about what SQL Server needs to do when retrieving BLOBs with T-SQL. In order to serve up the Photo column in the resultset shown above for the SELECT statement, for example, SQL Server needed to read the entire contents of each BLOB from the NTFS file system that it’s managing internally, and this can easily and suddenly place a great deal of memory pressure on the server.
To address this concern, FILESTREAM exposes the streaming API. When you use this API, SQL Server still manages the file system behind the scenes, only it shifts the burden and memory requirements of actually reading and writing BLOBs in the file system off of itself and onto the client application. This keeps the memory requirements on SQL Server very lean, regardless of how large your BLOBs may be.
The SqlFileStream class is a managed code wrapper around the streaming API, which makes it extremely easy to use from .NET. In C# or VB .NET, you start a database transaction and issue an INSERT statement, but you don’t actually include the BLOB content with the INSERT statement. Instead, SQL Server passes you back the information you need to create a SqlFileStream object. This object inherits from the base System.IO.Stream class, meaning that it supports all the standard read/write methods of standard .NET stream classes, including memory streams, HTTP request/response streams, and local file streams. So it’s easy to then stream your BLOBs in and out, using buffers in memory allocated to your application – not SQL Server. Then, you just commit the database transaction, and SQL Server automatically commits the NTFS file system transaction at the same time.
In my course, I show you SqlFileStream up close, and demonstrate how to program against the streaming API from a thick client application, a client/server (Web) application, and in an n-tier (WCF) scenario as well.
The FILESTREAM story only gets better with FileTable, added in SQL Server 2012. While FILESTREAM revolutionizes BLOB storage in SQL Server, it’s only accessible to developers and administrators. What about ordinary users? They’re certainly not going to write T-SQL or streaming API code to access BLOBs. And there’s also no way for ordinary client applications to access FILESTREAM data.
The solution is FileTable, which combines FILESTREAM with the hierarchyid data type to furnish an “emulated” file system; that is, a file system that users and applications can work with, but which is really a FileTable in a SQL Server database. A FileTable is just a regular table except that it has a fixed schema; specifically, it has these pre-determined columns for the metadata of the emulated file system:
Every row in a FileTable represents either a file or a folder (depending on the is_directory column), and the hierarchyid value in the path_locator column is what implements the folder structure of the emulated file system. The hierarchyid data type has methods that you can use to query and manipulate the structure; for example, you can programmatically move entire subtrees from one parent to another.
For rows that represent files, the file_stream column holds the actual BLOB, and this is a varbinary(max) FILESTREAM column. So behind the scenes, it is stored in the NTFS file system just like a varbinary(max) FILESTREAM column of an ordinary table (a non-FileTable) would be.
And so, in addition to being able to use T-SQL or the streaming API with a FileTable, the emulated file system that a FileTable represents also gets exposed to users and client applications via a Windows file share. As a result, changes made to the table in the database are reflected in the emulated file system, and conversely, changes made to the emulated file system by users or client applications are reflected automatically in the database, which ultimately pushes down into the physical NTFS file system being used for BLOB storage behind the scenes.
This blog post explained FILESTREAM, and its related feature, FileTable. We first saw how FILESTREAM supports scalable BLOB storage using the NTFS file system behind the scenes, and provides transparent T-SQL access using the varbinary(max) data type. We also learned about the streaming API and SqlFileStream, which shifts the burden and memory requirements for streaming BLOBs off of SQL Server and onto client applications, providing scalable BLOB access. And we finally saw how FileTable combines FILESTREAM with the hierarchyid data type to furnish an emulated file system on the front end that users and client applications can interact with, but which in actuality is just a table in the database.