Software in Silicon: What It Does and Why

Version 9

    by Renato Ribeiro

    Software features incorporated into Oracle's SPARC S7 and M7 processors provide increased security and higher performance for databases and software applications.

     

    The Software in Silicon design of the SPARC M7 processor, and the recently announced SPARC S7 processor, implement memory access validation directly into the processor so that you can protect application data that resides in memory. It also includes on-chip Data Analytics Accelerator (DAX) engines that are specifically designed to accelerate analytic functions. The DAX engines make in-memory databases and applications run much faster, plus they significantly increase usable memory capacity by allowing compressed databases to be stored in memory without a performance penalty.

     

    The following Software in Silicon technologies are implemented in the SPARC S7 and M7 processors:

     

    Note: Security in Silicon encompasses both Silicon Secured Memory and cryptographic instruction acceleration, whereas SQL in Silicon includes In-Memory Query Acceleration and In-Line Decompression.

     

    • Silicon Secured Memory is the first-ever end-to-end implementation of memory-access validation done in hardware. It is designed to help prevent security bugs, such as Heartbleed, from putting systems at risk by performing real-time monitoring of memory requests made by software processes. It stops unauthorized access to memory whether that access is due to a programming error or a malicious attempt to exploit buffer overruns. It also helps accelerate code development and helps ensure software quality, reliability, and security.
    • The SPARC M7 processor includes 32 cryptographic instruction accelerators while the SPARC S7 processor includes 8, one per core in both cases. This enables the system to deliver wire-speed encryption for secure data center operation without a performance penalty. The accelerators support industry-leading, industry-standard ciphers and hashes. The on-chip cryptographic acceleration was first introduced in SPARC processors in 2005 and the current implementation, with some enhancements, has been offered over four years.
    • In-Memory Query Acceleration increases the performance of in-memory database queries by operating on data that is streamed directly from memory via extremely high-bandwidth interfaces—with speeds up to 160 GB/sec—resulting in large performance gains. In-Memory Query Acceleration is implemented in the SPARC S7 and M7 processors through multiple accelerator engines (described in more detail later).
    • In-Line Decompression is a feature that significantly increases usable memory capacity. The SPARC M7 processor runs data decompression with performance that is equivalent to 64 CPU cores (24 CPU cores for the SPARC S7 processor). This capability allows compressed databases to be stored in memory while being accessed and manipulated at full speed.

     

    This article focuses on Silicon Secured Memory and SQL in Silicon.

     

    The Silicon Secured Memory feature can be always on for improved reliability and security. Oracle Database 12c supports Silicon Secured Memory. Existing applications can be enabled with Silicon Secured Memory, without recompiling, by linking with the correct Oracle Solaris libraries and being verified in a test environment. In addition, open APIs for Silicon Secured Memory are available for software developers.

     

    The In-Memory Query Acceleration and In-Line Decompression capabilities, jointly referred to as SQL in Silicon, are combined to maximize the use of memory capacity, bandwidth, and CPU cores, which has a big impact on performance. The SQL in Silicon capabilities are performed by the Data Analytics Accelerator (DAX) engines that are specifically designed into the SPARC M7 chip to accelerate analytic queries.

     

    DAX technology was initially used with the Oracle Database In-Memory option of Oracle Database 12c. Oracle has released open APIs for DAX allowing application developers to leverage DAX technology to accelerate a broad spectrum of analytics software.

     

    The following sections describe the Software in Silicon features of the SPARC S7 and M7 processors in more detail and provide examples of how they can benefit your applications.

     

    Detecting Memory Reference Errors and Attacks

     

    Data integrity is a primary area of interest in software development, especially when languages such as C/C++ are used. Speeding up time to market and building robust software applications are also top objectives during software development.

     

    The primary function of the Silicon Secured Memory feature is to detect and report memory reference errors. In multithreaded applications, lots of threads work on large, shared memory segments. While managing the allocation and release of shared memory segments, multithreaded applications can run into various coding problems that are time-consuming to diagnose. Two of these problems are

     

    • Silent data corruption
    • Buffer overruns

     

    Figure 1 illustrates the silent data corruption problem using a simplified example. There is a memory area that is shared by many processes, and there are two application threads running: Thread A and Thread B. The threads are color coded blue and green, respectively, to demarcate the memory areas that they should be accessing. However, sometimes while the application is running, Thread A starts writing to Thread B's green area by mistake. This can happen if Thread A allocates memory and subsequently frees it, but holds on to a pointer to it. Later, when Thread B later takes ownership of the memory space, if Thread A uses its obsolete pointer, Thread A will act on memory that is now owned by Thread B. If this malfunction is not caught immediately, it will be detected only when Thread B reads that memory location. In such a case, Thread B has data that has been silently corrupted by Thread A. This corruption manifests itself as a software bug with serious consequences, and such silent data corruptions are extremely difficult to diagnose.

     

    f1.jpg

    Figure 1. Silent data corruption caused by Thread A acting on Thread B's memory area

     

    Figure 2 illustrates the problem of buffer overruns. In this case, an application has a memory area allocated to it, but it erroneously starts writing or reading data beyond this allocated area. Sensitive data can be corrupted or leaked into other memory locations, and the application that owns the data is not aware of this. For example, a malicious attack could cause a catastrophic security breach that allows another application to read sensitive information that has been mistakenly made available to it.

     

    f2.gif

    Figure 2. Buffer overrun caused by Thread B using memory outside of its allocated area

     

    Example of How Silicon Secured Memory Stops Silent Data Corruption

     

    The Silicon Secured Memory feature enables faster detection of memory reference errors such as silent data corruption. As illustrated in Figure 3, a key in each memory pointer is used to indicate the memory version, or "color." During the process of memory allocation, a corresponding code is written to the memory. When this memory is accessed by any pointer, the key of the pointer attempting the access and the code of the memory being accessed are compared by the hardware. If there is a match, the access is legal; if there is no match, the memory reference error is caught immediately.

     

    f3.jpg
    Figure 3. Detecting silent data corruption by comparing memory versions

     

    In Figure 3, the memory allocated to Thread A is blue and the memory allocated to Thread B is green. When Thread A attempts to access the green memory area of Thread B (depicted by the dotted blue line), the memory version is compared to the pointer's four-bit pattern. Because the two patterns do not match, the problem is flagged immediately by the processor and relayed to the application, as indicated by the stop sign. This enables the application to take immediate action and drastically reduces the time required for application developers to troubleshoot such memory reference bugs.

     

    Example of How Silicon Secured Memory Stops Buffer Overruns

     

    Similarly, the Silicon Secured Memory feature solves the problem of buffer overruns, as shown in Figure 4. Using the same memory versioning described earlier in this article, if an application attempts to read data from or write data to unallocated memory areas, a flag is raised by the processor, which allows corrective action to be taken right away. This enables faster bug detection as well as improved capability for application developers to develop extremely secure applications.

     

    f4.jpg

    Figure 4. Detecting buffer overruns using memory versioning

     

    Accelerating Oracle Database In-Memory Queries

     

    With the growing size of information stored in today's databases, it is increasingly critical to access and use the right information. Compressing stored data to reduce the storage footprint is a key business strategy in every IT system design. At the same time, any enterprise performing analytics to shape its future business opportunities needs the ability to access, decompress, and run fast queries on the stored data. Oracle's software engineers incorporate various features into Oracle Database to effectively leverage the hardware resources—such as cores, I/O channels, and memory—of the underlying server. Oracle Database follows a "shared everything architecture" that allows flexibility for parallel execution and high concurrency without overloading the system.

     

    The In-Memory Query Acceleration and In-line Decompression features of the SPARC S7 and M7 processors were designed to work with Oracle Database In-Memory, which enables existing Oracle Database–compatible applications to automatically and transparently take advantage of columnar in-memory processing without additional programming or application changes.

     

    The primary design goal for Oracle Database In-Memory was fast responses for analytic operations. The traditional way of storing and accessing data in a database employs a row format, which is great for online transaction processing (OLTP) workloads that perform frequent inserts and updates and handle report-style queries. However, analytics run best with a columnar database format. With Oracle Database In-Memory, it is possible to have a dual-format architecture that provides a row format for OLTP operations and a column format for analytic operations.

     

    Oracle Database In-Memory populates the data in the in-memory (IM) column store, which provides various advantages. For example, a set of compression algorithms is automatically run on the data being stored in the IM column, resulting in storage savings. Then, when a query is run, it can scan and filter the compressed data. The volume of data scanned during the query is far less than when the query has to run on uncompressed data. The IM column store creates in-memory compression units (IMCUs), as shown in Figure 5. The in-memory columnar data is fragmented into these smaller IMCUs so that parallelization is possible when a query is run on all the data.

     

    f5.jpg

    Figure 5. In-memory compression units

     

    The SPARC S7 and M7 processors optimize Oracle Database In-Memory through a synergistic architecture that uses two key Silicon in Software features:

     

    • In-Memory Query Acceleration
    • In-line Decompression

     

    In the SPARC M7 processor, Oracle introduced eight data analytics accelerators (DAX), which are hardware units within the processor that are optimized to handle database queries quickly, each with four pipelines, or engines. In the case of the SPARC S7 processor, it includes four data analytics accelerators (DAX), also with four pipelines, or engines. The accelerator engines can process a total of 32 independent data streams in a SPARC M7 processor (and 16 in the S7 processor), offloading the processor cores to do other work. These accelerator engines are in addition to the cores that are present in both SPARC processors, and can process query functions such as decompress, scan, filter and join. Each thread in a SPARC M7 or S7 core has access to all the accelerator engines and can utilize them for various functions. Each accelerator is connected to the on-chip level 3 (L3) cache for very fast communication with the cores.

     

    Example of How In-Memory Query Acceleration Speeds Up Database Queries

     

    Imagine that one afternoon a manager is working on a quarter-end report when, for some reason, her boss asks her to go count the number of cars in the parking lot. The manager could go out to the lot and start counting the cars from start to finish, or she could call upon the help of team members. This relieves the manager from the job of counting parked cars, and she can get back to working on that important report. In a few moments, the team members give the manager the result, which is then happily passed on to the boss.

     

    Similarly, in the SPARC S7 or M7 processors, when a core receives a database query, it can offload it to an accelerator engine. After the query is offloaded to the accelerator engine, the core is free to resume other jobs, such as higher-level SQL functions. The accelerator engine runs the query and gathers the result, which it puts in the L3 cache for fast communication with the core. The core is notified about the completion of the query, and it picks up the result from the L3 cache. This query offload feature provides extremely fast query processing capabilities for the processor while freeing the cores to do other functions.

     

    The other advantage of this query offload is the parallelization that is facilitated by the 32 accelerator engines within each SPARC M7 processor (16 in the SPARC S7 processor). Each core has access to all the accelerator engines and can use them at the same time to run a single query in a completely parallel fashion. This parallelism is done by the processor and does not require the application code or the database application to perform any extra operations. The accelerator engines can take data streams directly from the memory subsystem through the processor's extremely high-bandwidth interfaces—which can reach 160 GB/sec. Therefore, queries can be performed on in-memory data at top speeds determined by the memory interface, rather than being controlled by the cache architecture that connects to the processor cores.

     

    As mentioned above, Oracle Database In-Memory organizes the columnar data in chunks of compressed data called in-memory compression units so the data can be worked on in parallel, rather than it being one big set of data.

     

    The following example demonstrates all the query optimizations and accelerations that happen when an Oracle Database In-Memory query is run on a SPARC S7 or M7 processor. Suppose a query is being run to find 'Total Number' of cars of the brand 'ABC' sold in the year '2005' over a set of data using the in-memory column store.

     

    The first optimization, which is provided by the columnar format, is realized right away. Only two columns need to be accessed: the column called 'Car Brand' and the column called 'Year of Sale.' Unlike with the row format, it is not necessary to access every column present in the row.

     

    The second optimization is that each of the IMCUs has storage indexes that are automatically created and maintained for every column. These storage indexes maintain a minimum and maximum value. In this example, the goal is to find cars sold in the year 2005. If '2005' is not a value in the range maintained by the storage index for the column 'Year of Sale,' then that IMCU need not be accessed at all. This is a significant advantage, because a quick comparison of '2005' with the minimum and maximum value of the storage index prunes down the number of IMCUs that need to be accessed.

     

    The third optimization is specific to the SPARC S7 and M7 processors, which are engineered to take advantage of the in-memory column store. When the query comes to a SPARC S7/M7 processor core, the core has accelerator engines at its disposal. It can work on multiple IMCUs at the same time, because it can assign each IMCU to a different accelerator engine. Instead of the traditional way, where the core thread scans and evaluates one IMCU block at a time, the SPARC S7 and M7 processors can take advantage of the parallelization option provided by the accelerator engines. In a different scenario, when multiple queries are run at the same time on a SPARC S7 or M7 processor core, the core can still assign different queries to different accelerator engines and achieve a significant performance boost because of this parallelism. Each accelerator engine then reads the relevant IMCUs directly from memory, processes the query, and returns the value to the processor core through a cache operation.

     

    Example of How In-line Decompression Speeds Up Database Queries

     

    Typically database reads are a lot more frequent than database writes, and this is especially true with analytics. Most of the time, the data written to a database is stored in a compressed format to conserve storage space. This implies that when a query needs to be run on a data set contained in a database, data decompression is required first. Although Oracle Database and other databases available today employ smart techniques to optimize when decompression is done, decompression inherently incurs performance overhead. Compression ratio is the ratio by which the database stores uncompressed data in a compressed format. If the uncompressed data is 2 TB and the compression ratio is 4:1, then the stored compressed data occupies 0.5 TB of space.

     

    Going back to the car query example mentioned above, in the first step, 1 GB of IMCU data in compressed form is brought into the processor and evaluated in its compressed form. If half of the entries contain the 'Year of Sale' as '2005,' then 0.5 GB of data is written out. In Step 2, this 0.5 GB is read back into the processor and decompressed for providing the result. If the compression ratio is 4:1, then the processor is writing out 2 GB of data as the final result. The following summarizes all the substeps:

     

    • The core reads 1 GB of compressed data in the IMCU.
    • The core scans and filters data in compressed format for the IMCU.
    • If, hypothetically, half of the entries have 'Year of Sale' as '2005,' the core writes out 0.5 GB of compressed data.
    • The core reads this 0.5 GB of compressed data.
    • The core decompresses this 0.5 GB of scanned and filtered data and writes out 2 GB of uncompressed data as the final result.

     

    SPARC processor engineers and Oracle Database developers identified this as a performance bottleneck and not an optimal way of using memory bandwidth, because a huge amount of data is being read and written for the two-step process.

     

    The SPARC S7 and M7 processors have the capability of decompressing data and running the query function in a single step using its accelerator engines, as shown in Figure 6. This provides an immense performance boost, because multiple reads and writes do not need to be done. The sequence is as simple as the following:

     

    • The core offloads query work to an accelerator engine, which reads 1 GB of compressed data.
    • The accelerator engine decompresses the data on the fly and evaluates the query in a single step without any additional read or write operations.
    • The core writes out 2 GB of uncompressed data as the final result.

     

    The extra overhead of reads and writes for evaluating the query is circumvented. This speeds up the overall query execution time as well as optimizes the efficient use of memory bandwidth by skipping unnecessary decompression read and write steps.

     

    Figure_6_DAX_engine.jpg

    Figure 6. The data analytics accelerator engines contain query and in-line decompression functions

     

    The ability of the core to offload database queries to the accelerator engines in a highly parallel manner, along with the ability to decompress data sets on the fly, provides a unique performance advantage to the SPARC S7 and M7 processors. An example of the performance improvement is shown in Figure 7, in which the performance of the In-line Decompression capability of the SPARC M7 processor is compared to the performance of the previous-generation SPARC T5 processor.

     

    f7.jpg

    Figure 7. 10x performance improvement through the in-line decompression process

     

    See Also

     

     

     

    Revision 1.0, 10/22/2015

     

    Follow us:
    Blog | Facebook | Twitter | YouTube