Secret SQL Weapon: Indexed (or Materialized) Views
A few months backs I implemented logic which needs to lookup entries in a big existing table, but the lookup wasn’t easy indexable. Here’s a guide on one approach you can take. I’m using SQL Server for this example. Let’s assume we have a table with guests, something like this
CREATE TABLE Guests (
Id BIGINT NOT NULL IDENTITY(1,1) PRIMARY KEY,
Created datetime NOT NULL DEFAULT getutcdate(),
Email VARCHAR(256) NOT NULL,
Name VARCHAR(256) NOT NULL
-- More fields
);
-- Example data
SELECT TOP (3) * FROM Guests
-- Result:
Id Created Email Name
1 2019-10-26 12:58:10.867 corrinne1981@rocketmail.com Corrinne Bordons
2 2019-10-26 12:58:10.877 admin@schriever.us Cythia Schriever
3 2019-10-26 12:58:11.177 darcie@tutanota.com Darcie Potencio
When we insert a new guest, we want to check how many guests are using the same email domain. OK, we can do this simple query:
-- A new guest with roman.stoffel@gamlor.info shows up, let's check how many @gamlor.info guests already exist.
SELECT COUNT(*) FROM guests
WHERE Email LIKE '%@gamlor.info'
-- SQL Server Execution Times:
-- CPU time = 2484 ms, elapsed time = 2838 ms.
The big issue: the query runs very slow =(. Let’s inspect the query plan:
Ah, it scans over the table. Let’s try to add an index to the Email column and see the result:
-- Add an index to the email
CREATE INDEX Guests_Email_Index ON Guests(Email);
-- The query got a bit faster, depending on the size of the overall table
SELECT COUNT(Email) FROM Guests
WHERE Email LIKE '%@gamlor.info'
SQL Server Execution Times:
CPU time = 1921 ms, elapsed time = 1920 ms.
The issue persists, the query still needs to scan. It scans the specifi Email index, but that is still too slow. I cannot do direct lookups because the wild card at the start cannot be used to picking a start in a B-Tree. Therefore the database scans the index.
Let’s approach the issue from another side. What if we first extract all domains from our table, then query for the domain explicitly:
WITH Domains AS (
SELECT SUBSTRING([Email], CHARINDEX('@',[Email])+1, LEN([Email])) AS Domain
FROM Guests)
SELECT COUNT(*) FROM Domains
WHERE Domain LIKE 'gamlor.info';
-- SQL Server Execution Times:
-- CPU time = 1891 ms, elapsed time = 1886 ms.
Well, still needs a scan. Alright, we could make the existing domains a view and then write the query against it. What happens then?
--- Create the View
CREATE VIEW dbo.GuestDomainsView
WITH SCHEMABINDING
AS SELECT
SUBSTRING([Email], CHARINDEX('@',[Email])+1, LEN([Email])) AS Domain
FROM dbo.Guests;
--- Query via View
SELECT COUNT(*) FROM GuestDomainsView
WHERE Domain LIKE 'gamlor.info';
-- SQL Server Execution Times:
-- CPU time = 1891 ms, elapsed time = 1886 ms.
The execution time is still terrible. While the view looks like a separate table, it is still backed by the query on the original table.
However, most larger SQL database support materialized views or indexed views, where the view result is stored on disk. SQL Server supports creating indexed views, so let’s use that:
CREATE INDEX GuestDomainsView_Index
ON dbo.GuestDomainsView(Domain);
-- Error:
-- Msg 1940, Level 16, State 1, Line 19
-- Cannot create index on view 'dbo.GuestDomainsView'. It does not have a unique clustered index.
-- Our domains data isn't unique for the index. So let's make the view unique first:
CREATE VIEW dbo.GuestDomainsView
WITH SCHEMABINDING
AS SELECT
SUBSTRING([Email], CHARINDEX('@',[Email])+1, LEN([Email])) AS Domain, COUNT_BIG(*) AS DomainCount
FROM dbo.Guests
GROUP BY SUBSTRING([Email], CHARINDEX('@',[Email])+1, LEN([Email]));
CREATE UNIQUE CLUSTERED INDEX GuestDomainsView_Index
ON dbo.GuestDomainsView(Domain);
A naive approach fails on SQL Server, because the index on a view needs to be a unique clustered
index. So we change
the view to have a list of domains on how many guests use that domain. Then index that and try our query again:
SELECT Domain FROM GuestDomainsView
WHERE Domain LIKE 'gamlor.info';
-- SQL Server Execution Times:
-- CPU time = 2781 ms, elapsed time = 2804 ms
Unfortunately, the perforce did not improve. If we look at the query plan, we can see that SQL Server still falls back to the
original table. The SQL Server engine still goes to the source table by default, in case you read out
other fields. We can fix that with a query NOEXPAND
query hint.
SELECT * FROM GuestDomainsView WITH (NOEXPAND)
WHERE Domain LIKE 'gamlor.info';
-- SQL Server Execution Times:
-- CPU time = 0 ms, elapsed time = 0 ms.
BAAM: Blazing fast query performance for our Domains.
To sum up: indexed (or materialized in other DBs) Views allows you to index the result of a complex query. There are various limits depending on the SQL Server version, but you often can juggle around the limitations and get a useful index for your use case.