Fsion - 1. Size

PLEASE NOTE SOURCE CODE IS NO LONGER AVAILABLE AND THE POST IS HERE JUST FOR INFORMATION

Previously we introduced Fsion - 0. Introduction, now we will explore database size.

The sample data set is around 10 months of daily position files for the 280 iShares funds available online. All fields in the files are loaded apart from any that can be derived.

Funds

280

Days

206

Position files

71,261

Size unzipped

4.4 GB

Size .zip normal

1.1 GB

Size .7z ultra

199 MB

DataSeries Compression

DataSeries is an immutable collection representing an ordered table of Reporting Date, Transaction Id and int64 encoded Values with the latest values at the top.

This is encoded as a byte array. The first row is stored as varints. Each subsequent row is stored as a difference to the above field values as varints.

With sensible encoding using offsets and the fact that values tend to be close to zero, even single row DataSeries are several times smaller than a more standard representation. Since the table is ordered, and the values in each row are very likely to be close to the ones above, very high compression ratios are possible.

Data Details

Text values are stored in a SetSlim<Text> collection, numeric values are encoded directly to int64. The DataSeries are stored in a MapSlim<EntityAttribute,DataSeries>.

Below is a table of count and number of bytes by entity type (column) and attribute (row):

Text: Count = 59,099 Max length = 50

Count
Bytes

transaction

entitytype

attribute

instrument

position

uri

0
0

0
0

0
0

0
0

0
0

name

0
0

5
15

20
60

38,036
279,391

0
0

time

71,262
783,876

0
0

0
0

0
0

0
0

attribute_type

0
0

0
0

20
60

0
0

0
0

attribute_isset

0
0

0
0

0
0

0
0

0
0

isin

0
0

0
0

0
0

36,476
273,162

0
0

ticker

0
0

0
0

0
0

38,036
275,050

0
0

currency

0
0

0
0

0
0

38,036
240,802

0
0

assetclass

0
0

0
0

0
0

38,023
261,032

0
0

sector

0
0

0
0

0
0

37,927
258,075

0
0

exchange

0
0

0
0

0
0

12,046
79,467

0
0

country

0
0

0
0

0
0

38,036
267,549

0
0

coupon

0
0

0
0

0
0

25,552
193,815

0
0

maturity

0
0

0
0

0
0

25,546
205,849

0
0

price

0
0

0
0

0
0

38,036
15,846,249

0
0

duration

0
0

0
0

0
0

25,552
5,740,430

0
0

fund

0
0

0
0

0
0

0
0

147,323
1,184,564

instrument

0
0

0
0

0
0

0
0

147,323
1,096,271

nominal

0
0

0
0

0
0

0
0

147,323
8,306,493

Size Estimates

Memory 32-bit: Text = 2.9 MB Data = 61.3 MB Total = 64.2 MB
Memory 64-bit: Text = 3.8 MB Data = 75.1 MB Total = 78.9 MB
Size on disk = 40.3 MB

Extrapolating these curves to 10 years of files would give total memory usage of around 650 MB. The files contain the key changing attributes. A database with a number of additional attributes would be expected to comfortably fit in 1 to 5 GB.

Conclusion

The database file size is small enough to store in github and can be used going forward for testing and performance benchmarking.

Looking at the size on disk compared to 32-bit and 64-bit in memory estimates shows that the objects and pointers contribute a large amount to the size. This is not surprising since each 32-bit object has an 8 byte header and 16 bytes for 64-bit, plus 4 and 8 bytes for each reference respectively. A whole single row DataSeries in the above table is only around 8 bytes.

If the DataSeries were not in a time series compressed format this object and pointer overhead would be a lot higher. This agrees with what is often found in server caches. Holding and tracking fine grained subsets of the database can actually use a lot of memory.

The data modelled is for one of the largest financial asset managers. Fast to calculate derived data such as profit and returns (which also tend to be costly to store) should be excluded. By doing this and storing only transaction data compressed in memory it is possible to use Fsion for many financial databases.

It also shows the estimates in the Data-First Architecture post are too high as they don't take account of the DataSeries compression that is possible.

Next, we will look at the performance characteristics of this database compared to other options.