Performance tips / what is normal?

Ingmyv · September 2, 2019, 4:41pm

I know this is an OLD thread but there is at least a colourable argument that this is actually a database design flaw/bug and I am going to raise a bug report for it: frequent, inefficient query on "Block" table is very slow on large databases. · Issue #3884 · duplicati/duplicati · GitHub

I have a suggested workaround too, see below.

The query originates from LocalDatabase.cs

    public long GetBlocksLargerThan(long fhblocksize)
    {
        using(var cmd = m_connection.CreateCommand())
            return cmd.ExecuteScalarInt64(@"SELECT COUNT(*) FROM ""Block"" WHERE ""Size"" > ?", -1, fhblocksize);
    }

This function in turn is called by various things, such as just running a backup, and compaction of the database. Unfortunately, I’m not a dev so I’m having a little trouble following the source code, but it also seems like it usually returns zero and is just a sanity check.

While it’s true that the disk IO is the bottleneck, an important latent question is how is SQLite’s implementing this query in such a manner as to cause the disk IO issues?

The slowdown is caused by the fact that the “Size” column is non-indexed and the planner is choosing an index with poor selectivity:

SELECT COUNT(1) 
  FROM "Block" 
 WHERE "Size" > 102400;

/* Result: 1 rows returned in 1528ms
At line 1:
SELECT COUNT(1) FROM "Block" WHERE "Size" > 102400; */

EXPLAIN QUERY PLAN 
SELECT COUNT(1) 
  FROM "Block" 
 WHERE "Size" > 102400;

  id   parent   notused                         detail                         
 ---- -------- --------- ----------------------------------------------------- 
   3        0         0   SCAN TABLE Block USING COVERING INDEX BlockHashSize

This choice is actually correct, it has no better alternative. The planner has no resources available to it to ascertain the location of matching rows, so it scans the whole thing!

This table has (in my case: much smaller than yours) 1,727,853 rows. but it only has 50,891 distinct values! The wasted effort is massive because (in fact) I have zero matching rows. So the planner gathers a huge amount of data only to find it must throw it all away!

SELECT 'total' as DETAIL, COUNT(1) as ROWCOUNT
  FROM "Block"
  UNION ALL SELECT 'distinct', count(DISTINCT "Size")
              FROM "Block" ;

   DETAIL    ROWCOUNT  
 ---------- ---------- 
  total       1727853  
  distinct      50891

The job of BlockHashSize is to ensure that the combination of the Hash and Size columns on this table is unique. It’s not well equipped for this purpose. A professional database like Postgres or Oracle will also perform poorly on this.

For a proof: if we recreate identical DDL in Postgres, and import the data (This DB happens to already have an unrelated block table, so the below output shows duplicati_block``, instead.)

EXPLAIN (VERBOSE, COSTS, ANALYZE) 
SELECT count (1) -- Note: PG evaluates (*) differently from (1)
  FROM "duplicati_block" 
 WHERE "Size" > 102400;                           

QUERY PLAN                                                                                                                                        
--------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate  (cost=29634.47..29634.48 rows=1 width=8) (actual time=577.102..577.103 rows=1 loops=1)                                        
  Output: count(1)                                                                                                                                
  ->  Gather  (cost=29634.26..29634.47 rows=2 width=8) (actual time=576.687..613.457 rows=3 loops=1)                                              
        Output: (PARTIAL count(1))                                                                                                                
        Workers Planned: 2                                                                                                                        
        Workers Launched: 2                                                                                                                       
        ->  Partial Aggregate  (cost=28634.26..28634.27 rows=1 width=8) (actual time=440.321..440.322 rows=1 loops=3)                             
              Output: PARTIAL count(1)                                                                                                            
              Worker 0: actual time=373.249..373.250 rows=1 loops=1                                                                               
              Worker 1: actual time=373.232..373.232 rows=1 loops=1                                                                               
              ->  Parallel Seq Scan on public.duplicati_block  (cost=0.00..28634.23 rows=10 width=0) (actual time=440.303..440.304 rows=0 loops=3)
                    Output: "ID", "Hash", "Size", "VolumeID"                                                                                      
                    Filter: (duplicati_block."Size" > 102400)                                                                                     
                    Rows Removed by Filter: 575951                                                                                                
                    Worker 0: actual time=373.235..373.235 rows=0 loops=1                                                                         
                    Worker 1: actual time=373.208..373.208 rows=0 loops=1                                                                         
Planning Time: 993.489 ms                                                                                                                         
Execution Time: 613.816 ms

As we can see, the PG planner–which is vastly more sophisticated–chose an essentially, structurally equivalent execution plan, the only real advantage it has is that Postgres is capable of parallelizing the scan (not all queries are parallelizable; SQLite does not have this feature at all, to my understanding.)

Possible one-off fix:

create index BlockCountSizeIdx on "Block"("Size");

SELECT COUNT(1) 
  FROM "Block" 
 WHERE "Size" > 102400;

/* Result: 1 rows returned in 9ms
At line 17:
SELECT COUNT(1) FROM "Block" WHERE "Size" > 102400; */

EXPLAIN QUERY PLAN 
SELECT COUNT(1) 
  FROM "Block" 
 WHERE "Size" > 102400;

  id   parent   notused                                 detail                                
 ---- -------- --------- -------------------------------------------------------------------- 
   3        0         0   SEARCH TABLE Block USING COVERING INDEX BlockCountSizeIdx (Size>?)

Notice in the ‘detail’ column how we’ve switched from “SCAN TABLE” which is SQLite-talk for the horrible, expensive solution of walking through the entire table (AKA “Sequential Scan” for Postgres) to the vastly more efficient “SEARCH TABLE” with an index predicate. I’m not as familiar with the output format of the SQLite “EXPLAINer” but I believe we’re looking at the equivalent of “Index Only Scan” here. If so, the planner realized it can answer this question solely by examining the index, so it wouldn’t’ve fetched any pages at all, so hardly any disk IO.

And FWIW, this is Postgres’ take on this optimization:

QUERY PLAN                                                                                                                                         
---------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=8.01..8.02 rows=1 width=8) (actual time=0.054..0.055 rows=1 loops=1)                                                              
  Output: count(1)                                                                                                                                 
  ->  Index Only Scan using blockcountsizeidx on public.duplicati_block  (cost=0.43..8.01 rows=1 width=0) (actual time=0.036..0.036 rows=0 loops=1)
        Output: "Size"                                                                                                                             
        Index Cond: (duplicati_block."Size" > 102400)                                                                                              
        Heap Fetches: 0                                                                                                                            
Planning Time: 1.229 ms                                                                                                                            
Execution Time: 0.282 ms

As before, it chose an very similar plan.

As was mentioned, there is a lot of postings here and there about the “SQLite performance problem” but to be very honest, I’m not positive there really is an SQLite performance problem in the sense of ‘hitting some theoretical limit’ of SQLite, despite the inefficient way writes work, or whether there is (as here) a DDL or query-design problem.

But SQLite is not instrumented for performance tracing the way a professional database is. It has no session history, no system catalogue views to inspect locks, blocking etc, no slow query log etc. This makes it difficult bordering on impossible to offer intelligent performance analyses.

On a 955 megabyte database, adding that index increased space by only 25 megabytes for a performance gain of a couple hundred times (1528ms vs 9ms) so this seems like a viable solution. Maybe the system could monitor the timings and create the index only when it exceeds running for, like 1 minute or something.