October 27, 2019

Secret SQL Weapon: Indexed (or Materialized) Views

A few months backs I implemented logic which needs to lookup entries in a big existing table, but the lookup wasn’t easy indexable. Here’s a guide on one approach you can take. I’m using SQL Server for this example. Let’s assume we have a table with guests, something like this

CREATE TABLE Guests (
   Id BIGINT NOT NULL IDENTITY(1,1) PRIMARY KEY,
   Created datetime NOT NULL DEFAULT getutcdate(),
   Email VARCHAR(256) NOT NULL,
   Name VARCHAR(256) NOT NULL
   -- More fields
);

-- Example data
SELECT TOP (3) * FROM Guests
-- Result:
Id Created    Email  Name
1  2019-10-26 12:58:10.867    corrinne1981@rocketmail.com    Corrinne Bordons
2  2019-10-26 12:58:10.877    admin@schriever.us         Cythia Schriever
3  2019-10-26 12:58:11.177    darcie@tutanota.com            Darcie Potencio

When we insert a new guest, we want to check how many guests are using the same email domain. OK, we can do this simple query:

-- A new guest with roman.stoffel@gamlor.info shows up, let's check how many @gamlor.info guests already exist.
SELECT COUNT(*) FROM guests
WHERE Email LIKE '%@gamlor.info'

-- SQL Server Execution Times:
--   CPU time = 2484 ms,  elapsed time = 2838 ms.

The big issue: the query runs very slow =(. Let’s inspect the query plan:

Plain Scan Query Plan
Figure 1. Plain Scan Query Plan

Ah, it scans over the table. Let’s try to add an index to the Email column and see the result:

-- Add an index to the email
CREATE INDEX Guests_Email_Index ON Guests(Email);
-- The query got a bit faster, depending on the size of the overall table
SELECT COUNT(Email) FROM Guests
WHERE Email LIKE '%@gamlor.info'
 SQL Server Execution Times:
   CPU time = 1921 ms,  elapsed time = 1920 ms.
Index Scan Query Plan
Figure 2. With Index Scan Query Plan

The issue persists, the query still needs to scan. It scans the specifi Email index, but that is still too slow. I cannot do direct lookups because the wild card at the start cannot be used to picking a start in a B-Tree. Therefore the database scans the index.

Let’s approach the issue from another side. What if we first extract all domains from our table, then query for the domain explicitly:

WITH Domains AS (
   SELECT SUBSTRING([Email], CHARINDEX('@',[Email])+1, LEN([Email])) AS Domain
   FROM Guests)
SELECT COUNT(*) FROM Domains
WHERE Domain LIKE 'gamlor.info';

-- SQL Server Execution Times:
--   CPU time = 1891 ms,  elapsed time = 1886 ms.

Well, still needs a scan. Alright, we could make the existing domains a view and then write the query against it. What happens then?

--- Create the View
CREATE VIEW dbo.GuestDomainsView
WITH SCHEMABINDING
AS SELECT
   SUBSTRING([Email], CHARINDEX('@',[Email])+1, LEN([Email])) AS Domain
   FROM dbo.Guests;

--- Query via View
SELECT COUNT(*) FROM GuestDomainsView
WHERE Domain LIKE 'gamlor.info';

-- SQL Server Execution Times:
--   CPU time = 1891 ms,  elapsed time = 1886 ms.

The execution time is still terrible. While the view looks like a separate table, it is still backed by the query on the original table.

Your Database Happily Indexes Views
Figure 3. Your Database Happily Indexes Views

However, most larger SQL database support materialized views or indexed views, where the view result is stored on disk. SQL Server supports creating indexed views, so let’s use that:

CREATE INDEX GuestDomainsView_Index
   ON dbo.GuestDomainsView(Domain);

-- Error:
-- Msg 1940, Level 16, State 1, Line 19
-- Cannot create index on view 'dbo.GuestDomainsView'. It does not have a unique clustered index.

-- Our domains data isn't unique for the index. So let's make the view unique first:
CREATE VIEW dbo.GuestDomainsView
WITH SCHEMABINDING
AS SELECT
   SUBSTRING([Email], CHARINDEX('@',[Email])+1, LEN([Email])) AS Domain, COUNT_BIG(*) AS DomainCount
   FROM dbo.Guests
   GROUP BY SUBSTRING([Email], CHARINDEX('@',[Email])+1, LEN([Email]));

CREATE UNIQUE CLUSTERED INDEX GuestDomainsView_Index
    ON dbo.GuestDomainsView(Domain);

A naive approach fails on SQL Server, because the index on a view needs to be a unique clustered index. So we change the view to have a list of domains on how many guests use that domain. Then index that and try our query again:

SELECT Domain FROM GuestDomainsView
WHERE Domain LIKE 'gamlor.info';

-- SQL Server Execution Times:
--   CPU time = 2781 ms,  elapsed time = 2804 ms
Index View Still Using the Backing Table
Figure 4. Index View Still Using the Backing Table

Unfortunately, the perforce did not improve. If we look at the query plan, we can see that SQL Server still falls back to the original table. The SQL Server engine still goes to the source table by default, in case you read out other fields. We can fix that with a query NOEXPAND query hint.

SELECT * FROM GuestDomainsView WITH (NOEXPAND)
WHERE Domain LIKE 'gamlor.info';

-- SQL Server Execution Times:
--   CPU time = 0 ms,  elapsed time = 0 ms.
Query Plan Using View Index
Figure 5. Query Plan Using View Index

BAAM: Blazing fast query performance for our Domains.

To sum up: indexed (or materialized in other DBs) Views allows you to index the result of a complex query. There are various limits depending on the SQL Server version, but you often can juggle around the limitations and get a useful index for your use case.

Tags: SQL