Why might a table's data space take up 4x the size of the raw data?

by Jon of All Trades   Last Updated August 14, 2019 13:06 PM

I have a table with 490 M rows and 55 GB of table space, so about 167 bytes per row. The table has three columns: a VARCHAR(100), a DATETIME2(0), and a SMALLINT. The average length of the text in the VARCHAR field is about 21.5, so the raw data should be around 32 bytes per row: 22+2 for the VARCHAR, 6 for the DATETIME2, and 2 for the 16-bit integer.

Note that the space above is data only, not indices. I'm using the value reported under Properties | Storage | General | Data space.

Of course there must be some overhead, but 135 bytes per row seems like a lot, especially for a large table. Why might this be? Has anyone else seen similar multipliers? What factors can influence the amount of extra space required?

For comparison, I tried creating a table with two INT fields and 1 M rows. The data space required was 16.4 MB: 17 bytes per row, compared to 8 bytes of raw data. Another test table with an INT and a VARCHAR(100) populated with the same text as the real table uses 39 bytes per row (44 K rows), where I would expect 28 plus a little.

So the production table has considerably more overhead. Is this because it's larger? I'd expect index sizes to be roughly N * log(N), but I don't see why the space required for actual data to be non-linear.

Thanks in advance for any pointers!

EDIT:

All of the fields listed are NOT NULL. The real table has a clustered PK on the VARCHAR field and the DATETIME2 field, in that order. For the two tests, the first INT was the (clustered) PK.

If it matters: the table is a record of ping results. The fields are URL, ping date/time, and latency in milliseconds. Data is constantly appended, and never updated, but data is deleted periodically to cut it down to just a few records per hour per URL.

EDIT:

A very interesting answer here suggests that, for an index with much reading and writing, rebuilding may not be beneficial. In my case, the space consumed is a concern, but if write performance is more important, one may be better off with flabby indices.



Answers 3


Have the data types changed over time? Have variable-length columns been removed? Have the indexes been defragmented often but never rebuilt? Have a lot of rows been deleted or have a lot of variable-length columns been updated significantly? Some good discussion here.

Aaron Bertrand
Aaron Bertrand
September 06, 2011 21:08 PM

The on-disk structures have overhead:

  • row header
  • null bitmap + pointer
  • variable length column offsets
  • row version pointers (optional)
  • ...

Taking 2 x 4 bytes int columns, you have

  • 4 bytes row header
  • 2 byte pointer to NULL bitmap
  • 8 bytes for 2 int columns
  • 3 bytes NULL bitmap

Wow 17 bytes!

You can the same for your second test table which has more overhead like your original one:

  • 2 bytes for the count of variable-length columns
  • 2 bytes per variable length column

Why the difference? In addition (I won't link to these)

  • have you ever rebuilt indexes to defragment them?
  • deletes do not reclaim space
  • data pages will split if you insert into the middle
  • updates may cause forward pointers (leaves a gap)
  • row overflow
  • removed varchar column without index rebuild or DBCC CLEANTABLE
  • heap or table (heap has no clustered index = records scattered all over)
  • RCSI isolation level (extra 14 bytes per row)
  • trailing spaces (SET ANSI_PADDING is ON by default) in varchar. Use DATALENGTH to checl, not LEN
  • Run sp_spaceused with @updateusage = 'true'
  • ...

See this: SQL Server: How to create a table that fills one 8 KB page?

From SO:

gbn
gbn
September 07, 2011 05:02 AM

After discussions in the comments on the original question, it appears in this case the lost space is caused by the choice of clustered key, which has led to massive fragmentation.

Always worth checking the state of fragmentation via sys.dm_db_index_physical_stats in these situations.

Edit: Following update in comments

The average page density (prior to rebuild of the clustered index) was 24%, which fits perfectly with the original question. The pages were only 1/4 full, so the total size was 4x the raw data size.

Mark Storey-Smith
Mark Storey-Smith
September 07, 2011 21:20 PM

Related Questions