Fsion - 1. Size

17 May 2019

PLEASE NOTE SOURCE CODE IS NO LONGER AVAILABLE AND THE POST IS HERE JUST FOR INFORMATION

Previously we introduced Fsion - 0. Introduction, now we will explore database size.

The sample data set is around 10 months of daily position files for the 280 iShares funds available online. All fields in the files are loaded apart from any that can be derived.

Funds	280
Days	206
Position files	71,261
Size unzipped	4.4 GB
Size .zip normal	1.1 GB
Size .7z ultra	199 MB

DataSeries Compression

DataSeries is an immutable collection representing an ordered table of Reporting Date, Transaction Id and int64 encoded Values with the latest values at the top.

This is encoded as a byte array. The first row is stored as varints. Each subsequent row is stored as a difference to the above field values as varints.

With sensible encoding using offsets and the fact that values tend to be close to zero, even single row DataSeries are several times smaller than a more standard representation. Since the table is ordered, and the values in each row are very likely to be close to the ones above, very high compression ratios are possible.

Data Details

Text values are stored in a SetSlim<Text> collection, numeric values are encoded directly to int64. The DataSeries are stored in a MapSlim<EntityAttribute,DataSeries>.

Below is a table of count and number of bytes by entity type (column) and attribute (row):

Text: Count = 59,099 Max length = 50

Count Bytes	transaction	entitytype	attribute	instrument	position
uri	0 0	0 0	0 0	0 0	0 0
name	0 0	5 15	20 60	38,036 279,391	0 0
time	71,262 783,876	0 0	0 0	0 0	0 0
attribute_type	0 0	0 0	20 60	0 0	0 0
attribute_isset	0 0	0 0	0 0	0 0	0 0
isin	0 0	0 0	0 0	36,476 273,162	0 0
ticker	0 0	0 0	0 0	38,036 275,050	0 0
currency	0 0	0 0	0 0	38,036 240,802	0 0
assetclass	0 0	0 0	0 0	38,023 261,032	0 0
sector	0 0	0 0	0 0	37,927 258,075	0 0
exchange	0 0	0 0	0 0	12,046 79,467	0 0
country	0 0	0 0	0 0	38,036 267,549	0 0
coupon	0 0	0 0	0 0	25,552 193,815	0 0
maturity	0 0	0 0	0 0	25,546 205,849	0 0
price	0 0	0 0	0 0	38,036 15,846,249	0 0
duration	0 0	0 0	0 0	25,552 5,740,430	0 0
fund	0 0	0 0	0 0	0 0	147,323 1,184,564
instrument	0 0	0 0	0 0	0 0	147,323 1,096,271
nominal	0 0	0 0	0 0	0 0	147,323 8,306,493

Size Estimates

Memory 32-bit: Text = 2.9 MB Data = 61.3 MB Total = 64.2 MB
Memory 64-bit: Text = 3.8 MB Data = 75.1 MB Total = 78.9 MB
Size on disk = 40.3 MB

Extrapolating these curves to 10 years of files would give total memory usage of around 650 MB. The files contain the key changing attributes. A database with a number of additional attributes would be expected to comfortably fit in 1 to 5 GB.

Conclusion

The database file size is small enough to store in github and can be used going forward for testing and performance benchmarking.

Looking at the size on disk compared to 32-bit and 64-bit in memory estimates shows that the objects and pointers contribute a large amount to the size. This is not surprising since each 32-bit object has an 8 byte header and 16 bytes for 64-bit, plus 4 and 8 bytes for each reference respectively. A whole single row DataSeries in the above table is only around 8 bytes.

If the DataSeries were not in a time series compressed format this object and pointer overhead would be a lot higher. This agrees with what is often found in server caches. Holding and tracking fine grained subsets of the database can actually use a lot of memory.

The data modelled is for one of the largest financial asset managers. Fast to calculate derived data such as profit and returns (which also tend to be costly to store) should be excluded. By doing this and storing only transaction data compressed in memory it is possible to use Fsion for many financial databases.

It also shows the estimates in the Data-First Architecture post are too high as they don't take account of the DataSeries compression that is possible.

Next, we will look at the performance characteristics of this database compared to other options.

« Fsion - 0. Introduction Causal Profiling in .NET »

Anthony Lloyd

Fsion - 1. Size

DataSeries Compression

Data Details

Size Estimates

Conclusion