1 2 3 4 Previous Next


58 posts
This post will help you get started using Apache Spark DataFrames with Scala on the MapR Sandbox. The new Spark DataFrames API is designed to make big data processing on tabular data easier. A Spark DataFrame is a distributed collection of data organized into named columns that provides operations to filter, group, or compute aggregates, and can be used with Spark SQL. https://www.mapr.com/blog/using-apache-spark-dataframes-processing-tabular-data
This tutorial will help you get started with Standalone Spark applications on the MapR Sandbox. https://www.mapr.com/products/mapr-sandbox-hadoop/tutorials/spark-tutorial
Recommendation engines help narrow your choices to those that best meet your particular needs.  In this post, we
SQL will become one of the most prolific use cases in the Hadoop ecosystem, according to Forrester Research. Apache Drill is an open source SQL query engine for big data exploration. REST services and clients have emerged as popular technologies on the Internet. Apache HBase is a hugely popular Hadoop NoSQL database. This blog post discusses combining all of these technologies: SQL, Hadoop, Drill, REST with JSON, NoSQL, and HBase, by showing how to use the Drill REST API to query HBase and Hive. It also shares a simple jQuery client that uses the Drill REST API, with JSON as the data exchange, to provide a basic user interface https://www.mapr.com/blog/how-use-sql-hadoop-drill-rest-json-nosql-and-hbase-simple-rest-client#.VKyvdCvF90Y

JPA Performance, Don't Ignore the Database

Database Schema

Good Database schema design is important for performance. One of             the most basic optimizations is to design your tables to take as little             space on the disk as possible , this makes disk reads faster and         uses less memory for query processing.

Data Types

You should use             the smallest data types possible, especially for indexed fields.         The smaller your data types, the more indexes (and data) can fit into a         block of memory, the faster your queries will be.


Database Normalization eliminates redundant data, which usually makes         updates faster since there is less data to change. However a Normalized         schema causes joins for queries, which makes queries slower,         denormalization speeds retrieval. More normalized schemas are better         for applications involving many transactions, less normalized are         better for reporting types of applications.  You should normalize         your schema first, then de-normalize later.  Applications often         need to mix the approaches, for example use a partially normalized         schema, and duplicate, or cache, selected columns from one table in         another table. With JPA O/R mapping you can use the @Embedded         annotation for denormalized columns to specify a persistent field whose         @Embeddable         type can be stored as an intrinsic part of the owning entity and share         the identity of the entity.



Database Normalization and Mapping Inheritance Hiearchies

The Class Inheritance hierarchy shown below will be used as an example         of JPA O/R mapping.

In the Single table per class mapping shown below, all             classes in the hierarchy are mapped to a single table in the database.         This table has a discriminator column (mapped by @DiscriminatorColumn), which         identifies the subclass.  Advantages: This is fast for querying, no joins are         required. Disadvantages:  wastage of space since all inherited fields are in every row, a deep inheritance hierarchy will result in wide tables with many, some empty columns.


In the Joined Subclass mapping shown below, the root of the class hierarchy is         represented by a single table, and each subclass has a separate table         that only contains those fields specific to that subclass. This is         normalized (eliminates redundant data) which is better for storage and         updates. However queries cause joins which makes queries slower         especially for deep hierachies, polymorphic queries and relationships.

In the Table per Class mapping (in JPA 2.0, optional in JPA 1.0),          every concrete class is mapped         to a table in the database and all the inherited state is repeated in         that table. This is not         normlalized, inherited data is repeated which wastes space.          Queries for Entities of the same type are fast, however          polymorphic queries cause unions which are slower.


Know what SQL is executed

You need to understand the SQL queries your application makes and         evaluate their performance. Its a good idea to enable SQL logging, then         go through a use case scenario to check the executed SQL.  Logging         is not part of the JPA specification, With EclipseLink you can enable logging of SQL by setting the following property in the persistence.xml file:

                <property name="eclipselink.logging.level" value="FINE"/>

With Hibernate you set the following property in the persistence.xml file:

                <property name="hibernate.show_sql" value="true" />

Basically you want to make your queries access less data, is your         application retrieving more data than it needs, are queries accessing         too many rows or columns? Is the database query analyzing more rows         than it needs? Watch out for the following:
  • queries which execute too often to retrieve needed data
  • retrieving more data than needed
  • queries which are too slow    
    • you can use EXPLAIN to see where you should add indexes

With MySQL you can use the slow query log to see which queries are executing slowly, or you can use the MySQL query analyzer to see slow queries, query execution counts, and results of EXPLAIN statements.

Understanding EXPLAIN

For slow queries, you can precede a SELECT statement with the keyword EXPLAIN          to get information about the query execution plan, which explains how         it would process the SELECT,  including information about how         tables are joined and in which order. This helps find missing indexes         early in the development process.

You should index columns that are frequently used in Query WHERE, GROUP         BY clauses, and columns frequently used in joins, but be aware that         indexes can slow down inserts and updates.

Lazy Loading and JPA

With JPA many-to-one and many-to-many relationships lazy load by         default, meaning they will be loaded when the entity in the         relationship is accessed. Lazy loading is usually good, but if you need         to access all of the "many" objects in a relationship, it will cause         n+1 selects where n is the number of "many" objects.


You can change the relationship to be loaded eagerly as follows :


However you should be careful with eager loading which could cause         SELECT statements that fetch too much data. It can cause a Cartesian         product if you eagerly load entities with several related collections.

If you want to override the LAZY fetch type for specific use cases,         you can use Fetch Join. For example this query would eagerly load the         employee addresses:
In General you should lazily load relationships, test your use case         scenarios, check the SQL log, and use @NameQueries with JOIN FETCH to         eagerly load when needed.


the main goal of partitioning is to reduce the amount of data read for         particular SQL operations so that the overall response time is reduced

                 Vertical Partitioning          splits tables with many columns into multiple tables with fewer         columns, so that only certain columns are included in a particular         dataset, with each partition including all rows.

                 Horizontal Partitioning segments table rows         so that distinct groups of physical row-based datasets are formed. All columns defined to a table are found in         each set of partitions. An         example of horizontal partitioning might be a table that contains         historical data being partitioned by date.

Vertical Partitioning

In the example of vertical partitioning below a table that contains a         number of very wide text or BLOB columns that aren't referenced often         is split into two tables with the most referenced columns in one table         and the seldom-referenced text or BLOB columns in another.

By removing the large data columns from the table, you get a faster         query response time for the more frequently accessed Customer data.         Wide tables can slow down queries, so you should always ensure that all         columns defined to a table are actually needed.
The example below shows the JPA mapping for the tables above. The         Customer data table with the more frequently accessed and smaller data         types  is mapped to the Customer Entity, the CustomerInfo table         with the less frequently accessed and larger data types is mapped to         the CustomerInfo Entity with a lazily loaded one to one relationship to         the Customer.


Horizontal Partitioning

The major forms of horizontal partitioning are by Range, Hash, Hash Key, List, and Composite.

Horizontal partitioning can make queries faster because the query         optimizer knows what partitions contain the data that will satisfy a         particular query and will access only those necessary partitions during         query execution. Horizontal Partitioning works best for large database         Applications that contain a lot of query activity that targets specific         ranges of database tables.

Hibernate Shards

Partitioning data horizontally into "Shards"  is used by google, linkedin, and others to give extreme scalability for very large amounts of data. eBay "shards" data horizontally along its primary access path.

Hibernate Shards is a framework that is designed to encapsulate support for horizontal partitioning into the Hibernate Core.


JPA Level 2 caching         avoids database access for already loaded entities, this make reading         reading frequently accessed unmodified entities faster, however it can         give bad scalability for frequent or concurrently updated entities.

You should configure L2 caching for entities that are:
  • read often
  • modified infrequently
  • Not critical if stale
You should also configure L2 (vendor specific) caching for maxElements, time to expire, refresh...

References and More Information:

JPA             Best Practices presentation
MySQL             for Developers Article
MySQL             for developers presentation
MySQL             for developers screencast
Keeping             a Relational Perspective for Optimizing Java Persistence
Java Persistence with Hibernate
Pro EJB 3: Java             Persistence API
Java             Persistence API 2.0: What's New ?
High Performance MySQL book
Pro MySQL, Chapter 6: Benchmarking and Profiling
EJB 3 in Action
sharding the hibernate way
JPA Caching
Best Practices for Large-Scale Web Sites: Lessons from eBay

Filter Blog

By date: