Tombstones: An overlooked reason your Spanner query performance degrades (and how to solve it)

Measuring the ghost effect and using time to skip the graveyard

When optimizing query performance in Spanner, a frequently overlooked performance factor is the presence of “tombstones” which are deletion markers generated by Multi-Version Concurrency Control (MVCC) systems.

In Google’s Spanner, a DELETE operation does not immediately reclaim storage space. Instead, it writes a tombstone to preserve transaction history until background compaction processes sweep through. These tombstone markers can degrade scan performance in especially certain high churn workloads if not managed correctly.

This article explores the mechanics of tombstones, demonstrates their impact on latency using a controlled experiment and discusses mitigation strategies.

The Underlying Mechanism: MVCC and LSM Trees

Spanner, like many modern databases, utilizes Multi-Version Concurrency Control (MVCC) backed by a Log-Structured Merge (LSM) tree storage engine.

In this architecture, a DELETE is functionally a write operation.

When a record is deleted, Spanner does not immediately reclaim storage space or modify existing data files. Instead, it appends a new marker, a tombstone, indicating that the record is invalid as of a specific timestamp.

The same applies to UPDATEs on indexes which are effectively translated to a DELETE — a versioned marker — followed by an INSERT.

This design is necessary for generally speaking two reasons:

Snapshot Isolation: Concurrent transactions running AS OF SYSTEM TIME in the past must still be able to read the data as it existed before the deletion.
Immutable Storage: Spanner utilizes a Log-Structured Merge (LSM) tree architecture. Data files are immutable which means modifications are handled by appending new versions rather than overwriting existing blocks.

The Performance Implication: During a read operation, the query engine cannot simply skip these deleted rows. It must scan the tombstone to verify that the underlying data is no longer visible at the current read timestamp. This process consumes CPU and I/O resources, meaning that scanning a “deleted” row costs roughly the same as scanning a “live” row until background compaction eventually removes it.

The Showcase Experiment

To quantify the overhead of tombstone scans, we can construct a scenario using a “Status” queue pattern which is a pattern where rows are frequently inserted and deleted from a specific index range.

1. Schema Configuration

We utilize a table with a UUID primary key to ensure uniform distribution in the base table. However, we force the query to scan a secondary index on Status, where the data is contiguous.

Note: The STORING clause is used to create a covering index, ensuring that rows_scanned metrics reflect only the index read operations, isolating the variable we wish to measure.

CREATE TABLE Tasks (
  TaskId STRING(36) NOT NULL DEFAULT (GENERATE_UUID()),
  TaskName STRING(100),
  Status STRING(20),
  CreatedAt TIMESTAMP,
) PRIMARY KEY (TaskId);


CREATE INDEX TasksByStatus ON Tasks(Status) STORING (TaskName);

2. Baseline Data Generation

We insert 5,000 “ACTIVE” records to act as noise. We then insert 10 “COMPLETED” records as our test target.

-- Insert Noise (5,000 Active Records)
INSERT INTO Tasks (TaskName, Status, CreatedAt)
SELECT CONCAT('Active Task ', CAST(x AS STRING)), 'ACTIVE', CURRENT_TIMESTAMP()
FROM UNNEST(GENERATE_ARRAY(1, 5000)) AS x;


-- Insert Target (10 Completed Records)
INSERT INTO Tasks (TaskName, Status, CreatedAt)
SELECT CONCAT('Completed Task ', CAST(x AS STRING)), 'COMPLETED', CURRENT_TIMESTAMP()
FROM UNNEST(GENERATE_ARRAY(1, 10)) AS x;

3. Execution Analysis

Phase A: The Clean Read

`We query the index for completed tasks.

SELECT TaskName 
FROM Tasks@{FORCE_INDEX=TasksByStatus} 
WHERE Status = 'COMPLETED';

Performance Result:

Rows Returned        = 10
Deleted rows scanned = 0
Rows Scanned         = 10

The engine reads 10 live rows and returns them.

Phase B: The Delete Effect

We delete the target records. This writes 10 tombstones to the index.

DELETE FROM Tasks WHERE Status = 'COMPLETED';

SELECT TaskName 
FROM Tasks@{FORCE_INDEX=TasksByStatus} 
WHERE Status = 'COMPLETED';

Performance Result:

Rows Returned        = 0
Deleted rows scanned = 10
Rows Scanned         = 0

Despite returning an empty result set, the latency profile remains similar to Phase A. The engine was required to process 10 index entries to determine they were effectively deleted relative to the transaction timestamp.

Phase C: Accumulation

We simulate a workload where new data replaces the old. We insert 10 new completed records.

INSERT INTO Tasks (TaskName, Status, CreatedAt) 
SELECT CONCAT('Completed Task ', CAST(x AS STRING)), 
'COMPLETED', CURRENT_TIMESTAMP() FROM UNNEST(GENERATE_ARRAY(1, 10)) AS x;

SELECT TaskName
FROM Tasks@{FORCE_INDEX=TasksByStatus}
WHERE Status = 'COMPLETED';

Performance Result:

Performance Result:
Rows Returned        = 10
Deleted rows scanned = 10
Rows Scanned         = 10

The scan must now process the 10 new live records plus the 10 tombstones from the previous deletion.

Phase D: Compounding Overhead

We perform another cycle of deletion and insertion.

DELETE FROM Tasks WHERE Status = 'COMPLETED';

INSERT INTO Tasks (TaskName, Status, CreatedAt) 
SELECT CONCAT('Completed Task ', CAST(x AS STRING)), 
'COMPLETED', CURRENT_TIMESTAMP() FROM UNNEST(GENERATE_ARRAY(1, 10)) AS x;

SELECT TaskName
FROM Tasks@{FORCE_INDEX=TasksByStatus}
WHERE Status = 'COMPLETED';

Performance Result:

Rows Returned        = 10
Deleted rows scanned = 20
Rows Scanned         = 10

The scan overhead has tripled compared to Phase A, despite the result set size remaining constant.

Implications and Solutions

This accumulation of tombstones results in “Scan Amplification”. In such “high-churn” workload patterns, this can lead to queries that appear efficient (returning a few rows) but suffer from high latency due to processing large numbers of deleted versions of records.

If we cannot prevent the workload from creating tombstones, we must prevent the query engine from reading them as much as possible.

A solution for high-churn tombstone causing workload patterns in Spanner is to utilize the index to perform a range seek rather than a full range scan. By instructing the query to start reading at a specific timestamp, we effectively tell the engine to “jump over” the data blocks containing the deleted history.

We require an index that sorts primarily by our status, but secondarily by time.

-- Schema
CREATE TABLE Tasks (
 TaskId STRING(36) NOT NULL DEFAULT (GENERATE_UUID()),
 TaskName STRING(100),
 Status STRING(20),
 CreatedAt TIMESTAMP,
) PRIMARY KEY (TaskId);
CREATE INDEX TasksByStatus ON Tasks(Status, CreatedAt) STORING (TaskName);

We populate the table with active “noise” and a batch of completed tasks, then immediately delete the completed tasks to generate tombstones.

-- 1. Insert 5,000 'ACTIVE' rows (The Noise)
INSERT INTO Tasks (TaskName, Status, CreatedAt)
SELECT CONCAT('Active Task ', CAST(x AS STRING)), 'ACTIVE', CURRENT_TIMESTAMP()
FROM UNNEST(GENERATE_ARRAY(1, 5000)) AS x;

-- 2. Insert 10 'COMPLETED' rows
INSERT INTO Tasks (TaskName, Status, CreatedAt)
SELECT CONCAT('Completed Task ', CAST(x AS STRING)), 'COMPLETED', CURRENT_TIMESTAMP()
FROM UNNEST(GENERATE_ARRAY(1, 10)) AS x;

We note the timestamp of the last deletion operation. This timestamp acts as your “cursor” or “offset”:

-- 3. Delete the completed tasks (Creates 10 Tombstones)
DELETE FROM Tasks WHERE Status = 'COMPLETED';

Now we insert again a new batch of tasks:

-- 4. Insert new batch and capture the timestamp
INSERT INTO Tasks (TaskName, Status, CreatedAt)
SELECT
 CONCAT('Completed Task ', CAST(x AS STRING)),
 'COMPLETED',
 CURRENT_TIMESTAMP()
FROM UNNEST(GENERATE_ARRAY(1, 10)) AS x;

Scenario A — The Naive Query: If we query simply by Status, Spanner scans from the beginning of the ‘COMPLETED’ range. It must process the 10 tombstones from Step 3 before finding the 10 new records from Step 4.

SELECT TaskName
FROM Tasks@{FORCE_INDEX=TasksByStatus}
WHERE Status = 'COMPLETED';

Performance Impact:

Rows Returned: 10
Deleted rows scanned: 10 (The tombstone penalty)
Rows Scanned: 20 (Total work)

Scenario B — The Timestamp Offset (Solution): By appending the time constraint, we force the engine to seek directly near the position where the new data begins.

SELECT TaskName
FROM Tasks@{FORCE_INDEX=TasksByStatus}
WHERE Status = 'COMPLETED' AND CreatedAt >= "2025-11-26T08:56:57.41295088Z";

Performance Impact:

Rows Returned: 10
Deleted rows scanned: 0
Rows Scanned: 10 (Total work)

Tombstones: An overlooked reason your Spanner query performance degrades (and how to solve it) was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source Credit: https://medium.com/google-cloud/tombstones-an-overlooked-reason-your-spanner-query-performance-degrades-and-how-to-solve-it-de3f787698fc?source=rss—-e52cf94d98af—4