Abstract
The rate at which applications gather geospatial data today has turned data loading into a critical component of data analysis pipelines. However, users are confronted with multiple file formats for storing geospatial data and an array of systems for processing it. To shed light on how the choice of file format and system affects performance, this paper explores the performance of loading geospatial data stored in diverse file formats using different libraries. It aims to study the impact of different file formats, compare loading throughput across spatial libraries, and examine the microarchitectural behavior of geospatial data loading. Our findings show that GeoParquet files provide the highest loading throughput across all benchmarked libraries. Furthermore, we note that the more spatial features per byte a file format can store, the higher the data loading throughput. Our micro-architectural analysis reveals high instructions per cycle (IPC) during spatial data loading for most libraries and formats. Additionally, our experiments show that instruction misses dominate L1 cache misses, except for GeoParquet files, where data misses take over.
Original language | English |
---|---|
Title of host publication | DBTest '24: Proceedings of the Tenth International Workshop on Testing Database Systems |
Number of pages | 7 |
Publisher | Association for Computing Machinery |
Publication date | 9 Jun 2024 |
Pages | 36-42 |
ISBN (Electronic) | 9798400706691 |
DOIs | |
Publication status | Published - 9 Jun 2024 |
Keywords
- spatial libraries
- benchmarking
- micro-architectural analysis
- database performance evaluation
- geographic information systems