Chapter4: Spatial Storage and Indexing

Chapter4: Spatial Storage and Indexing
4.1 Storage:Disk and Files 4.2 Spatial Indexing 4.3 Trends 4.4 Summary

Learning Objectives Learning Objectives (LO)
LO1: Understand concept of a physical data model What is a physical data model? Why learn about physical data models? LO2: Learn how to efficiently use storage devices LO3: Learn how to structure data files LO4: Learn how to use auxiliary(辅助的) data-structures LO5: Learn about technology trends in physical data model Focus on concepts not procedures! Mapping Sections to learning objectives LO2, LO LO LO

Physical model in 3 level design?
Recall 3 levels of database design Conceptual model: high level abstract description Logical model: description of a concrete realization Physical model: implementation using basic components Analogy with vehicles Conceptual model: mechanisms to move, turn, stop, ... Logical models: Car: accelerator pedal（踏板）, steering wheel, brake pedal, … Bicycle: pedal forward to move, turn handle, pull brakes on handle Physical models : Car: engine, transmission, master cylinder（汽缸）, break lines, brake pads, … Bicycle: chain from pedal to wheels, gears（齿轮）, wire from handle to brake pads（垫）

What is a physical data model?
What is a physical data model of a database? Concepts to implement logical data model Using current components, e.g. computer hardware, operating systems In an efficient and fault-tolerant(容错) manner Why learn physical data model concepts? To be able to choose between DBMS brand names Some brand names do not have spatial indices! To be able to use DBMS facilities for performance tuning For example, If a query is running slow, one may create an index to speed it up For example, if loading of a large number of tuples takes for ever one may drop indices on the table before the inserts and recreate index after inserts are done!

Concepts in a physical data model
Database concepts Conceptual data model - entity, (multi-valued) attributes, relationship, … Logical model - relations, atomic attributes, primary and foreign keys Physical model - secondary storage hardware, file structures, indices, … Examples of physical model concepts from relational DBMS Secondary storage hardware: Disk drives File structures - sorted Auxiliary search structure - search trees (hierarchical collections of one-dimensional ranges)

An interesting fact about physical data model
Physical data model design is a trade-off（平衡） between Efficiently support a small set of basic operations of a few data types Simplicity of overall system Each DBMS physical model Choose a few physical DM techniques Choice depends chosen sets of operations and data types Relational DBMS physical model Data types: numbers, strings, date, currency one-dimensional, totally ordered Operations: search on one-dimensional totally order data types insert, delete, ...

Physical data model for SDBMS
Is relational DBMS physical data model suitable for spatial data? Relational DBMS has simple values like numbers Sorting, search trees are efficient for numbers These concepts are not natural for Spatial data (e.g. points in a plane) Reusing relational physical data model concepts Space filling curves define a total order for points This total order helps in using ordered files, search trees But may lead to computational inefficiency! New spatial techniques Spatial indices, e.g. grids, hierarchical collection of rectangles Provide better computational performance

Common assumptions for SDBMS physical model
Spatial data Dimensionality of space is low, e.g. 2 or 3 Data types: OGIS data types Approximations for extended objects (e.g. linestrings, polygons) Minimum Orthogonal Bounding Rectangle (MOBR or MBR) MBR(O) is the smallest axis-parallel rectangle enclosing an object O Supports filter and refine processing of queries Spatial operations OGIS operations, e.g. topological, spatial analysis Many topological operations are approximated by “Overlap” Common spatial queries - listed in next slide

Common Spatial Queries and Operations
Physical model provides simpler operations needed by spatial queries! Common Queries Point query: Find all rectangles containing a given point. Range query（范围查询）: Find all objects within a query rectangle. Nearest neighbor: Find the point closest to a query point. Intersection query: Find all the rectangles intersecting a query rectangle. Common operations across spatial queries find : retrieve records satisfying a condition on attribute(s) Find next : retrieve next record in a dataset with total order after the last one retrieved via previous find or find next Nearest neighbor of a given object in a spatial dataset

Scope of discussion 4.1 Storage:Disk and Files 4.2 Spatial Indexing 4.3 Trends 4.4 Summary Learn basic concepts in physical data model of SDBMS Review related concepts from physical DM of relational DBMS Reusing relational physical data model concepts Space filling curves define a total order for points This total order helps in using ordered files, search trees But may lead to computational inefficiency! New techniques Spatial indices, e.g. grids, hierarchical collection of rectangles Provide better computational performance

LO1: Understand concept of a physical data model LO2 : Learn how to efficiently use storage devices Concepts in Storage Hierarchy Characteristics of secondary storage Using secondary storage efficiently LO3: Learn how to structure data files LO4: Learn how to use auxiliary data-structures LO5: Learn about technology trends in physical data model Mapping Sections to learning objectives LO2, LO (4.1.1) LO LO

4.1 Storage:Disk and Files-- Storage Hierarchy in Computers
Computers have several components Central Processing Unit (CPU) Input, output devices, e.g. mouse, keyword, monitors, printers Communication mechanisms, e.g. internal bus（内部总线）, network card, modem Storage Hierarchy Types of storage Devices Main memories - fast but content is lost when power is off Secondary storage - slower, retains content without power Tertiary storage （如磁带驱动器）- very slow, retains content, very large capacity DBMS usually manage data on secondary storage, e.g. disks Use main memory to improve performance User tertiary storage (e.g. tapes) for backup, archival etc.

Measure of efficiency I/O cost: Number of disk sectors retrieved from secondary storage CPU cost: Number of CPU instruction(指令) used See Table 4.1 for relative importance of cost components Total cost = sum of I/O cost and CPU cost

物理存储介质层次速度基本存储容量寄存器高速缓冲存储器主存储器联机存储快闪存储器磁盘存储器
Secondary storage 脱机存储光盘存储器磁带存储器 Tertiary storage

物理存储介质：基本存储寄存器（register） CPU的一部分，用于暂存运算中间结果与运算部件直接连接，速度最快，极少（几十个）
高速缓冲存储器（cache memory） CPU的一部分，用于缓存主存储器在CPU中，速度极快，容量小（几十K~几百K）操作系统底层管理

物理存储介质：基本存储主存储器（main memory）通过总线与CPU相连，存储运算所需的数据和指令
速度很快（纳秒级），一般容量在几十M～几百M 随机访问：访问任何存储单元时间相同易失性：断电丢失操作系统提供机制，应用程序管理

物理存储介质：在线存储快闪存储器（flash memory）磁盘存储器（disk memory）
通过外设接口与总线相连，存储永久保留的数据速度受到存储介质和接口限制随机访问，非易失性，断电不丢失文件系统管理，可以通过操作系统在线访问磁盘存储器（disk memory）同上，但是机械装置，速度更慢

物理存储介质：脱机存储光盘存储器（CDROM/CDR/CDRW/DVD）脱机存储，保存备份或者历史档案分为光物理盘，光化学盘和光磁盘
只读，可写一次，可重复读写机械装置，随机访问，速度更低有标准数据记录格式，操作系统提供文件系统接口访问

物理存储介质：脱机存储磁带存储器（tape）脱机存储，保存备份或者历史档案电磁记录原理机械装置，顺序访问
速度最低，容量价格比最高（至几百G）

4.1.1 Secondary Storage Hardware: Disk Drives
Disk concepts Circular platters with magnetic storage medium （圆形磁盘片） Multiple platters are mounted on a spindle Platters are divided into concentric tracks （同心磁道） A cylinder（柱面） is a collection of tracks across platters with common radium Tracks are divided into sectors（扇区） A sector size may a few kilo-Bytes 扇区就是每一个磁道中被分成若干等分的区域磁道：带有间距的同心圆 0柱面 1柱面一个磁盘块有整数倍个扇区，这个倍数可在磁盘初始化时设定。磁盘块(或简称为页面)是磁盘与主存之间的最小传输单元。

磁盘存储物理特性盘片磁道track 扇区sector 柱面cylinder 磁头驱动臂

Disk drive concepts Disk drive concepts
Disk heads （磁头）to read and write There is disk head for each platter (recording surface) A head assembly moves all the heads together in radial direction Spindle rotates at a high speed, e.g. thousands revolution per minute Accessing a sector has three major steps: Seek（寻道）: Move head assembly to relevant track （ts） Latency(延迟时间): Wait for spindle to rotate relevant sector under disk head（tl） Transfer: Read or write the sector (tt) Other steps involve communication between disk controller and CPU

Using Disk Hardware Efficiently
Disk access cost are affected by Placement of data on the disk Fact than seek cost > latency cost > transfer (See Table 4.2, pp. 86) A few common observations follow Size of sectors Larger sector provide faster transfer of large data sets But waste storage space inside sectors for small data sets Placement of most frequently accessed data items On middle tracks rather than innermost or outermost tracks Reason: minimize average seek time Placement of items in a large data set requiring many sectors Choose sectors from a single cylinder Reason: Minimize seek cost in scanning the entire data set. 虽然传输时间tt在磁盘初始化的时候已经固定，但将数据按照一定策略放置在磁盘上仍然可以在很大程度上减少寻道时间ts和延迟时间t1。

4.1.2 Buffer Management Motivation
Accessing a sector on disk is much slower than accessing main memory Idea: Keep repeatedly accessed data in main memory buffers To improve the completion time of queries Reducing load on disk drive Buffer Manager software module decides Which sectors stay in main memory buffers? Which sector is moved out if we run out of memory buffer space? When to pre-fetch sector before access request from users? These decision are based on the disk access patterns of queries!

虚拟内存当系统运行时，先要将所需的指令和数据从外部存储器（如硬盘、软盘、光盘等）调入内存中，CPU 再从内存中读取指令或数据进行运算，并将运算结果存入内存中，内存所起的作用就像一个“二传手”的作用。当运行一个程序需要大量数据、占用大量内存时，内存这个仓库就会被“塞满”，而在这个“仓库”中总有一部分暂时不用的数据占据着有限的空间，所以要将这部分“惰性”的数据请出，让出地方给“活性”数据使用。这时就需要新建另一个后备“仓库”去存放“惰性”数据。由于硬盘的空间很大，所以微软Windows操作系统就将后备“仓库”的地址选在硬盘上，这个后备“仓库”就是虚拟内存。

4.1.3 Software view of Disks: Fields, Records and File
Views of secondary storage (e.g. disks) Hardware views - discussed in last few slides Software views - Data on disks is organized into fields, records, files Concepts Field presents a property or attribute of a relation or an entity Records represent a row in a relational table Collection of fields for attributes in relational schema of the table Files are collections of records Homogeneous collection of records may represent a relation Heterogeneous（不同类）collections may be a union of related relations.

Mapping Records and files to Disk
Often smaller than a sector Many records in a sector Files with many records Many sectors per file File system Collection of files Organized into directories 扇区

City table takes 2 sectors Others take 1 sector each
COUNTRY表:size(name)+size(cont)+size(pop)+size(GDP)+size(1ife -exp)+size(shape)≤ =80 每个扇区512字节将记录从Country、City和River表映射到磁盘页 , 每个扇区还包含格式信息，用于识别空闲空间、每条记录的开头和可能的域、下一个扇区的地址等。图4-1忽略了实际存储的很多细节，但仍有一些基本信息可以说明在二级存储中如何进行表的物理存储。使用这些不同的参数，就出现了许多针对特定应用来专门组织域、记录和文件的方法。例如，一条记录的域可以是定长或变长的；文件中的记录可以是有序或无序的；文件可以组织成链表或页面目录。每种组织方法都经过了彻底的检验，各有优劣， Mapping tables to disk Figure 4.1 City table takes 2 sectors Others take 1 sector each

相关概念每个磁道都划分成扇区，扇区的大小由驱动器的厂商规定。一个磁盘块有整数倍个扇区，这个倍数可在磁盘初始化时设定。磁盘块(或简称为页面)是磁盘与主存之间的最小传输单元。页的概念是一个很好的抽象，有利于理解数据在不同内存设备之间的移动。而在高层次的交互中，将数据看作用文件、记录和域这种层次结构组织起来的集合能更有效地进行交互。文件是记录的集合，一个文件(可能)跨越多个页面。一个页面是槽(slot)的集合，每个槽包含一条记录，每条记录都是相同或不同类型的域的集合。

LO1: Understand concept of a physical data model LO2 : Learn how to efficiently use storage devices LO3: Learn how to structure data files What is a file structure? Why structure files? What are common structures for spatial data file? LO4: Learn how to use auxiliary data-structures LO5: Learn about technology trends in physical data model Mapping Sections to learning objectives LO2, LO LO LO

4.1.4 File Structures - selected file operations
Common file operations Find: key value --> record matching key values Find next --> Return next record after find if records were sorted Insert --> Add a new record to file without changing file-structure Nearest neighbor of a object in a spatial dataset Examples using Figure 4.1, pp. 88 find(Name = Canada) on Country table returns record about Canada Find next() on Country table returns record about Cuba since Cuba is next value after Canada in sorted order of Name insert(record about Panama) into Country table adds a new record location of record in Country file depends on file-structure nearest neighbor Argentina in country table is Brazil

4.1.4 File Structure--Common File Structures
What is a file structure? A method of organizing records in a file For efficient implementation of common file operations on disks Example: ordered files or unordered files Common file structures Heap or unordered or unstructured Ordered Hashed Clustered Descriptions follow Basic Comparison of Common File Structures Heap file is efficient for inserts and used for logfiles But find, findnext, etc. are very slow Hashed files are efficient for find, insert, delete etc. But findext is very slow Orderd file oranization are very fast for findnext and pretty competent for find, insert, etc.

4.1.4 File Structures: Heap(堆 )
Records are in no particular order (Example: Figure 4.1) insert can simple add record to the last sector find, find next, nearest neighbor scan the entire files RIVER表(无序文件),根据给定的关键码(如name)查找一条记录需要扫描文件中的记录。在最坏情况下，文件的所有记录都要被检查，所有存储该文件数据的磁盘页面都要被访问。平均来说，需要检索一半的磁盘页面。无序文件的主要优点是在进行插入操作时可以很容易地在文件末尾插入一条新记录。

File Structure : Hash(散列)
Components of a Hash file structure (Fig. 4.2) A set of buckets (sectors) Hash function : key value --> bucket Hash directory: bucket --> sector Operations find, insert, delete are fast compute hash function lookup directly fetch relevant sector Find next, nearest neighbor are slow no order among records 散列函数将事先选择一个主码域(如city．name)的值映射到一个散列单元中，它采用非常简单的计算完成这项工作。图4-2给出了一个散列文件，它有四个散列单元，每个单元存储在一个独立的磁盘页面中。对于小于等于6个字符的名字散列函数返回l；对于7至8个字符的名字散列函数返回2；对于9至10个字符散列函数返回3；对于11个以上字符的名字散列函数返回4。城市的名字通过散列函数分别映射到散列单元1到4中。散列函数的可取之处在于它能够把数量大致相同的记录放入每个散列单元中。散列文件组织方式并不适合范围查询，例如，查找所有名字以字母“B”开头的城市的详细信息城市的名字通过散列函数分别映射到散列单元1到4中。可取之处在于它能够把数量大致相同的记录放入每个散列单元中

4.1.4 File Structures: Ordered
Records are sorted by a selected field (Example Fig. 4.3 below) 有序文件根据给定的主码域对记录进行组织 find next can simply pick up physically next record find, insert, delete may use binary search(折半法 ), it is very efficient nearest neighbor processed as a range query (see pp. 95 for details) 有序文件组织方式不能直接应用在空间领域，例如，除非对多维空间中的点定义一个全序，否则无法对城市的位置排序。有序文件组织方式还可以根据对空间数据集的文件组织方式而概括成空间聚类。 Figure 4.3

4.1.5 Spatial File Structures: Clustering
Motivation: Ordered files are not natural for spatial data Clustering records in sector by space filling curve is an alternative In general, clustering groups records accessed by common queries into common disk sectors to reduce I/O costs for selected queries 按其最抽象的形式来说，聚类的目的就是降低响应常见的大查询的寻道时间(ts)和等待时间(t1)。对于空间数据库来说，这意味着在二级存储中，空间上相邻的和查询上有关联性的对象在物理上应当存储在一起。

SDBMS支持三种聚类内部聚类(internal clustering)：为了加快对单个对象的访问，一个对象的全部表示都存放在同一个磁盘页面中，这里假设它小于页面的空闲空间；否则，这个对象就要存储在多个物理上连续的页面中。在这种情况下，对象占用的页面数至多比存储该对象所需的最小页面数大一。本地聚类(1ocal clustering)：为了加快对多个对象访问的速度，一组空间对象(或者近似)被分组到同一页面。这种分组可以依照数据空间中对象的位置(或近似)来施行。全局聚类(global clustering)：与本地聚类相反，一组空间邻接的对象并不存储在一个而是多个物理上邻接的页面中，这些页面可以由一条单独的读命令访问。

空间聚类技术 Clustering using Space filling curves
空间聚类技术的设计比传统聚类技术要复杂得多，因为在空间数据所处的多维空间中根本没有天然的顺序。事实上，存储磁盘从逻辑上说是一维的设备，需要一个从高维空间向一维空间的映射，该映射是距离不变(distance-preserving)的，这样空间上邻近的元素映射为直线上接近的点，而且一一对应，即空间上不会有两个点映射到直线的同一个点上。为达到这一目标，提出了许多映射方法(它们都不能完全理想地满足这一目标)。 Clustering using Space filling curves Z-curve Hilbert-curve Details on following 3 slides

A space filling curve,利用一个线性顺序来填充空间，可以获得从一端到另一端的曲线
用置换规则如何统一生成 Z-Curve What is a Z-curve? A space filling curve,利用一个线性顺序来填充空间，可以获得从一端到另一端的曲线 Generated from interleaving bits （交叉存取） x, y coordinate See Fig. 4.6 Alternative generation method see Fig. 4.5 Connecting points by z-order see Fig. 4.4 looks like Ns or Zs Implementing file operations similar to ordered files Fig 4.6 Fig 4.4

Example of Z-values Figure 4.7
Left part shows a map with spatial object A, B, C Right part and Left bottom part Z-values within A, B and C Note C gets z-values of 2 and 8, which are not close Exercise: Compute z-values for B. Fig 4.7

Hilbert Curve Fig 4.5 A space filling curve More complex to generate
Example: Fig. 4.5 More complex to generate due to rotations See details on pp Illustration on next slide! Implementing file operations similar to ordered files

Hilberlt曲线转换 1)读入x和y坐标的n比特二进制表示。 2)隔行扫描二进制比特到一个字符串。

Calculating Hilbert Values (Optional Topic)
1000 “00”等于0，“01”等于l； “10”等于3；“11”等于2 对于数组中每个数字j，如果　·j=0把后面数组中出现的所有l变成3，并把所有出现的3变成1。　·j=3把后面数组中出现的所有0变成2，并把所有出现的2变成0。 32-> 1110=14 Fig 4.8

Z曲线和Hilbert曲线的聚类效率图4-9 聚类的图示：a) Z曲线的两个聚类。b)Hilbert曲线的两个聚类 Hilbert曲线的方法要比Z曲线好一些，因为它没有斜线。不过，Hilbert算法和精确人口点及出口点的计算量都要比Z曲线复杂。

区域处理一个区域通常划分成一个或多个片(块)，每个块可由一个z值描述。块(block)是原始图像由四叉树分割的一个或多个部分的正方形区域。四叉树分割是递归地把空间分成四个相等的部分 ZB=11** 12 13 14 15 10 11 1100 1101 1110 1111

把一幅2n×2n的图像压缩成线性四叉树的过程
1°、按Morton码把图象读入一维数组。 2°、相邻的四个象元比较，一致的合并，只记录第一个象元的Morton码。循环比较所形成的大块，相同的再合并，直到不能合并为止。 3°、进一步用游程长度编码压缩。压缩时只记录第一个象元的Morton码。 A 0 A 1 A 4 A 5 A 2 B 3 B 6 B 7 A 8 A 9 B 12 B 13 A 10 A 11 B 14 B 15 右图的压缩处理过程为： 1°、按Morton码读入一维数组。 Morton码：象元值： A A A B A B B B A A A A B B B B 2°、四相邻象元合并，只记录第一个象元的Morton码。 A A A B A A B B A B 3°、由于不能进一步合并，则用游程长度编码压缩。 A B A B A B A 0 A 1 A 4 A 5 A 2 B 3 B 6 B 7 A 8 A 9 B 12 B 13 A 10 A 11 B 14 B 15 解码时，根据Morton码就可知道象元在图像中的位置(左上角)，本Morton码和下一个Morton码之差即为象元个数。知道了象元的个数和象元的位置就可恢复出图像了。

每个对象(和范围查询)都可以用该块的z值唯一地表示(或者用它的块的最大和最小z值表示)。
每个这样的z值都以(z值、对象id、其他属性……)的形式作为记录的主码，并且可以插入到一个主码文件结构中，例如B+树。可以用相同方法处理在同一地址空间上的其他对象；它们的z值将插入到同一棵B+树。这些块可以用于高效处理范围查询。

6．访问方法的算法使用Z序方法可以处理列出的所有查询点查询：使用二分法在排序文件中查找给出的z值，或在基于z值的B树索引上使用B树搜索。
·范围查询：查询形状可以翻译成一组z值，就像它是一个数据区域一样。通常，选择它的近似表示，尽量平衡z值的数量和近似中出现额外区域的数量。然后搜索数据区域的z值以匹配z值。匹配的有效性：用z1和z2分别代表两个z值，其中z1是较短的一个，并未失去一般性；对于相应的区域(比如块)r1和r2，只有两种可能：1)如果z1是z2的前缀(例如，z1=l***，z2=11**或z1=*l**，z2=11**)，则r1完全包含r2；2)两个区域不相交(例如，z1=*0**，z2=11**)。 z2 z1

空间连接：空间连接：空间连接算法是范围查询算法的一般化。
设S是空间对象集(例如lakes)，R是另一个对象集(例如railway-1ine segment)。空间连接“找出所有穿过湖的铁路”将按如下方法进行处理：将集合S的元素转化成z值并排序；将集合R的元素也转化成排序后的z值列表，合并两个z值表。其中，在确定交叠时要小心处理代表“任意”字符的“*”。

采用Z序的B树采用Z序的B树会有一些缺陷。一个问题和连接操作有关：如果网格的结构不一致，索引的分区就不能直接连接，为了易于处理空间连接就不得不重新计算索引。作为空间填充曲线，Z序的另一个缺点是它有很长的对角线跨越，连接这些跨越的连续z值在X-Y坐标系中距离很大。z序的空间聚类可以通过使用Hilbert曲线来改进。

LO1: Understand concept of a physical data model LO2 : Learn how to efficiently use storage devices LO3: Learn how to structure data files LO4: Learn how to use auxiliary data-structures Concept of index Spatial indices, e.g. Grids / Grid-file and R-tree families Focus on concepts not procedures! LO5: Learn about technology trends in physical data model Mapping Sections to learning objectives LO2, LO LO LO

What is an index? Fig 4.10 索引记录被排序数据文件本身可以是不按关键码排序
Concept of an index auxiliary file to search a data file 索引文件是用来提高数据文件查询效率的辅助文件 Example: Fig. 4.10 index records have key value address of relevant data sector see arrows in Fig. 4.10 Index records are ordered find, findnext, insert are fast Note assumption of total order on values of indexed attributes Name—30 bytes, address –8bytes 索引记录被排序 Fig 4.10 数据文件本身可以是不按关键码排序

索引--加快检索速度由于索引文件中的记录是有序的，那么即使数据文件无序，也可以使用折半查找来搜索索引文件。此外，索引文件日益缩小(根据磁盘页面数量)，因此也加快了搜索速度。一旦找到一条适当的索引记录，就可以通过索引记录所指向的数据文件磁盘页面，获得数据记录。

Classifying indexes Classification criteria Data-file-structure
Key data type others Secondary index Heap data file 1 index record per data record Example Fig. 4.10 Primary index Data file ordered by indexed attribute 1 index record per data sector Example: Fig. 4.11 Q? A table can have at most one primary index. Why? Fig 4.11 主索引：如果数据文件的记录是按主码排序的，那么索引就只需要保存数据文件的每个磁盘页面第一个主码域值

桶空间索引结构用一组桶(bucket)(通常对应二级存储的页面)来组织对象
每个桶有一个关联的桶区域，即包含了存储在桶中全部对象的一部分空间。桶区域通常是矩形的。对于点数据结构来说，这些区域通常是不相交的，它们将空间分区使每个点正好属于一个桶。对于一些矩形数据结构来说，桶区域可能是交叠的。

Attribute data types and Indices
提供空间索引主要有两种方法 1)在系统中加入专门的外部空间数据结构，为空间属性提供如同B树之于线性属性的功能。 2)使用空间填充曲线(如Z序、Hilbert曲线)将空间对象映射到一维空间，以便空间对象存储在标准的一维索引(例如B树)中。 Index file structure depends on data type of indexed attribute Attributes with total order Example, numbers, points ordered by space filling curves B-tree is a popular index organization See Figure 1.12 (pp. 18) and section 1.6.4 Spatial objects (e.g. polygons) Spatial organization are more efficient Hundreds of organizations are proposed in literature Two main families are Grid Files and R-trees

Ideas behind Grid Files
Basic idea- Divide space into cells by a grid 固定网格方法将嵌入的n维空间划分成同等大小的桶 Example: Fig. 4.12, Example:latitude-longitude, ESRI Arc/SDE Store data in each cell in distinct disk sector Efficient for find, insert, nearest neighbor But may have wastage of disk storage space non-uniform data distribution over space Fig 4.12 结构对于处理静态一致分布式数据(例如卫星影像)很有用。然而，固定网格结构是固定的，它的目录稀疏而巨大。

Refinement of basic idea into Grid Files
1. Use non-uniform grids (Fig. 4.14) Linear scale store row and column boundaries 2. Allow sharing of disk sectors across grid cells 定位存储X=60、Y=80的记录／对象的桶？共享桶为了获得更好的灵活性和性能，[Nievergelt et a1．，1984]引人了网格文件(grid file)。网格文件对于精确匹配和部分匹配检索都有相对较好的I／O性能。网格文件采用多属性索引技术来分割嵌入的n维空间。从局部上说，它用一个网格方法，即在分割方向上划分所有区域。使用网格文件的目标就是为了实现二次磁盘访问原则：一次访问获图4-14　线性比例的网格文件

Grid Files Fig 4.13 Grid File component
Linear scale - row/column boundaries Grid directory: cell --> disk sector address data sectors on disk Operation implementation Scales and grid directory in main memory Steps for find, nearest neighbor Search linear scales Identify selected grid directory cells Retrieve selected disk sectors Performance overview Efficient in terms of I/O costs Needs large main memory for grid directory Fig 4.13

4.2.2 R-Tree Family Basic Idea
Use a hierarchical collection of rectangles to organize spatial data Generalizes B-tree to spatial data sets R树：B树在k维上的自然扩展 Classifying members of R-tree family Handling of large spatial objects Allow rectangles to overlap - R-tree Duplicate objects but keep interior node rectangles disjoint - R+tree Selection of rectangles for interior nodes greedy procedures - R-tree, R+tree procedure to minimize coverage, overlap - packed R-tree Other criteria exist Scope of our discussion Basics of R-tree and R+tree Focus on concepts not procedures!

Spatial Objects with R-Tree
树的每个结点对应一个磁盘页面。一个叶结点包括一组项，其格式为(I，元组标识符)，其中I为MBR，元组标识符是数据库中存储对应于MBR的对象的元组唯一标识符。非叶结点由多个格式为(I，子结点指针)的项组成，其中I是子结点指针(child-pointer)指向的更低层结点项中所有矩形的MBR。树中每个结点最多有M个条目，最少有m个(其中m≤M／2)，除非它是根。根结点至少有两个子结点，除非它是叶。 Fig 4.15 Fig 4.16

Properties of R-trees Fig 4.15 Fig 4.16
Balanced（树中每个结点最多有M个条目，最少有m个(其中m≤M／2) Nodes are rectangle child’s rectangle within parent’s possible overlap among rectangles! Other properties : 1) 每个叶结点包含m至M条索引记录(其中m≤M／2)，除非它是根结点。 2)一个叶结点上的每条索引记录了(I，元组标识符)，I是最小外包矩形，在空间上包含了所指元组表达的k维数据对象。 3)每个非叶结点都有m至M个子结点，除非它是根结点。 4)对于非叶结点中的每个条目(I，子结点指针)，I是在空间上包含其子结点中矩形的最小外包矩形。 5)根结点至少有两个子结点，除非它是叶结点。 6)所有叶结点出现在同一层。 7)所有MBR的边与一个全局坐标系的轴平行 . Fig 4.15 Fig 4.16

Spatial Objects with R-Tree
Implementation of find operation Search root to identify relevant children Search selected children recursively 点查询和范围查询在R树中可以用自顶向下递归的方法进行处理。查询点(或区域)首先同根结点中每个项(I，子结点指针)进行比较。如果查询点在I中(或查询区域与其交叠)，则查找算法就递归地应用在子结点指针指向的R树结点上。该过程直至R树的叶结点为止。使用叶结点中选出的项来检索与选中空间主码关联的记录。 Fig 4.15 Ex.: find record for rectangle 5 Root search identifies child x Search of x identifies children b and c Search of b does not find object 5 Search of c find object 5 Fig 4.16 由于R树是一棵平衡树，在与插入对象对应的叶结点已满的情况下，插入操作可能会导致结点向根部分裂。

R+tree Fig 4.18 Fig 4.17 Properties of R+trees Balanced
Interior nodes are rectangle child’s rectangle within parent’s disjoint rectangles Leaf nodes - MOBR of polygons or lines leaf’s rectangle overlaps with parent’s Data objects may be duplicated across leafs Other properties in section 4.2.2 find operation - same as R-tree But only one child is followed down Ex.: find record for rectangle 5 Root search identifies child x Search of x identifies children b and c Search either b or c to find object 5 Fig 4.17

LO1: Understand concept of a physical data model LO2 : Learn how to efficiently use storage devices LO3: Learn how to structure data files LO4: Learn how to use auxiliary data-structures LO5: Learn about technology trends in physical data model Mapping Sections to learning objectives LO2, LO LO LO

4.3 Trends New developments in physical model
Use of intra-object indexes Support for multiple Concurrent operations Index to support spatial join operations Motivation: large objects (e.g. polygon boundary of USA has 1000s of edges Algorithms for OGIS operations (e.g. touch, crosses) often need to check only a few edges of the polygon Relevant edges can be identified by spatial index on edges Example: Fig. 4.19, pp. 105, section 4.3.1 Uniqueness intra-object index organizes components within a large spatial object traditional index organizes a collection of spatial objects

4.3.2 Trends - Concurrency support
Why support Concurrent operations? SDBMS is shared among many users and applications Simultaneous requests from multiple users on a spatial table serial processing of request is not acceptable for performance concurrent updates and find can provide incorrect results Concurrency control idea for R-tree index R-link tree: Add links to chain nodes at each level Use links to ensure correct answer from find operations Use locks on nodes to coordinate conflicting updates Details in section and Fig. 4.20, pp. 107

4.3.3 Trends: Join Index Fig 4.21 Ideas Details in section 4.3.3
Spatial join is a common operation. Expensive to compute using traditional indexes Spatial join index pre-computes and stores id-pairs of matched rows across tables Example in Fig. 4.21 Speeds up computation of spatial join Details in section 4.3.3 Fig 4.21

Summary Physical DM efficiently implements logical DM on computer hardware Physical DM has file-structure, indexes Classical methods were designed for data with total ordering fall short in handling spatial data because spatial data is multi-dimensional Two approaches to support spatial data and queries Reuse classical method Use Space-Filling curves to impose a total order on multi-dimensional data Use new methods R-trees, Grid files

§空间索引（Spatial Index）与空间检索
空间索引(也称为空间访问方法（Spatial Access Method SAM）)，就是指依据空间对象的位置和形状或空间对象之间的某种空间关系按一定的顺序排列的一种数据结构，其中包含空间对象的概要信息，如对象的标识、外接矩形及指向空间对象实体的指针。作为一种辅助性的空间数据结构，空间索引介于空间操作算法和空间对象之间，它通过筛选作用，大量与特定空间操作无关的空间对象被排除，从而提高空间操作的速度和效率。空间索引的性能的优劣直接影响空间数据库和地理信息系统的整体性能，它是空间数据库和地理信息系统的一项关键技术。常见大空间索引一般是自顶向下、逐级划分空间的各种数据结构空间索引，比较有代表性的包括BSP树、K-D-B树、R树、R+树和CELL树、四叉树等。此外，结构较为简单的索引文件、格网型空间索引有着广泛的应用。

一、索引文件除记录本身的主文件外，还利用索引法列出一个键值K与其对应记录Rk的磁盘地址A(Rk)的索引表，即索引是由关键字和指针组成的索引项构成。

1、索引非顺序文件定义索引表中顺序列出所有可能的键值（稠密索引），利用二分查找法查找所需键值，得到所需记录地址。该方法存取快，且无需记录顺序排列。建立方法记录按输入的顺序放入数据区，同时软件在索引区建立索引表，待全部数据输完后，软件自动将索引表排序。索引排序

1、索引非顺序文件 * 便于增删记录

维护删除增加修改删除索引项，数据区保留，重新组织文件时消除之。删除数据，索引保留，重新组织文件时消除之。
数据放在文件末尾，增加索引项，并排序。修改查找相应位置，修改记录内容。（修改前后内容大小不一致时，涉及到删除增加操作）

排序费时；文件大时索引速度较慢。解决方法——建立多级索引
排序费时；文件大时索引速度较慢。解决方法——建立多级索引

2、索引顺序文件定义是一种按照逻辑键值排序的索引文件，是用嵌入索引的手段把顺序文件予以扩充，以加速查找，记录的物理顺序与索引中键值的顺序是一致的。（采用稀疏索引）建立方法数据按顺序分块存放（块间相临），记录每块的最后记录键值及块的首地址形成索引表。

索引顺序文件的索引机制一级索引两级索引

*不足：增删较麻烦，多次增删后，文件的空间利用率、存储效率均降低，需要重新组织文件。
2、索引顺序文件维护删除物理删除逻辑删除增加避免移动过多文件，将之暂放于溢出取。（重新组织文件时归位）修改查找相应位置，修改记录内容。 * 索引紧凑，查找速度快。 *不足：增删较麻烦，多次增删后，文件的空间利用率、存储效率均降低，需要重新组织文件。

二、格网索引（Grid Index) 思路：
是将研究区域用横竖线条划分大小相等或不等的格网，记录每一个格网所包含的空间实体。当用户进行空间查询时，首先计算出用户查询对象所在格网，然后再在该网格中快速查询所选空间实体，这样一来就大大地加速了空间索引的查询速度。步骤：划分行列（M X N）；计算网格大小及每个格网的矩形范围；开辟目标空间（记录目标穿过的网格）和格网空间（记录格网内的目标）；注册点、线、面、注记等目标，并记录之；提取窗口所覆盖的目标关键字（采用数据位方法，以降低排序时间，及避免数据的绘制顺序等）；提取目标所涉及的网格。

格网索引实例

三、其他空间索引 1、BSP树 2、KDB树 KDB树是BSP树向多维空间的一种发展。

3、四叉树索引线性四叉树空间索引分层四叉树空间索引

4、R树

5、R+树

6、CELL树

四、空间检索基于属性特征查询基于空间关系和属性特征的查询关系数据库 + 条件查询空间扩展SQL查询语言
SELECT…FROM…WHERE

四、空间检索空间实体间的关系检索面与面线与线点与点线与点面与线面与点

Chapter4: Spatial Storage and Indexing

Similar presentations

Presentation on theme: "Chapter4: Spatial Storage and Indexing"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

Chapter4: Spatial Storage and Indexing

Similar presentations

Presentation on theme: "Chapter4: Spatial Storage and Indexing"— Presentation transcript:

Similar presentations

About project

反馈