Skip to Main Content

DevOps, CI/CD and Automation

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Software in Silicon: What It Does and Why

Javed A Mohammed-OracleOct 22 2015 — edited Oct 21 2016

by Renato Ribeiro

Software features incorporated into Oracle's SPARC S7 and M7 processors provide increased security and higher performance for databases and software applications.

The Software in Silicon design of the SPARC M7 processor, and the recently announced SPARC S7 processor, implement memory access validation directly into the processor so that you can protect application data that resides in memory. It also includes on-chip Data Analytics Accelerator (DAX) engines that are specifically designed to accelerate analytic functions. The DAX engines make in-memory databases and applications run much faster, plus they significantly increase usable memory capacity by allowing compressed databases to be stored in memory without a performance penalty.

The following Software in Silicon technologies are implemented in the SPARC S7 and M7 processors:

Note: Security in Silicon encompasses both Silicon Secured Memory and cryptographic instruction acceleration, whereas SQL in Silicon includes In-Memory Query Acceleration and In-Line Decompression.

  • Silicon Secured Memory is the first-ever end-to-end implementation of memory-access validation done in hardware. It is designed to help prevent security bugs, such as Heartbleed, from putting systems at risk by performing real-time monitoring of memory requests made by software processes. It stops unauthorized access to memory whether that access is due to a programming error or a malicious attempt to exploit buffer overruns. It also helps accelerate code development and helps ensure software quality, reliability, and security.
  • The SPARC M7 processor includes 32 cryptographic instruction accelerators while the SPARC S7 processor includes 8, one per core in both cases. This enables the system to deliver wire-speed encryption for secure data center operation without a performance penalty. The accelerators support industry-leading, industry-standard ciphers and hashes. The on-chip cryptographic acceleration was first introduced in SPARC processors in 2005 and the current implementation, with some enhancements, has been offered over four years.
  • In-Memory Query Acceleration increases the performance of in-memory database queries by operating on data that is streamed directly from memory via extremely high-bandwidth interfaces—with speeds up to 160 GB/sec—resulting in large performance gains. In-Memory Query Acceleration is implemented in the SPARC S7 and M7 processors through multiple accelerator engines (described in more detail later).
  • In-Line Decompression is a feature that significantly increases usable memory capacity. The SPARC M7 processor runs data decompression with performance that is equivalent to 64 CPU cores (24 CPU cores for the SPARC S7 processor). This capability allows compressed databases to be stored in memory while being accessed and manipulated at full speed.

This article focuses on Silicon Secured Memory and SQL in Silicon.

The Silicon Secured Memory feature can be always on for improved reliability and security. Oracle Database 12_c_ supports Silicon Secured Memory. Existing applications can be enabled with Silicon Secured Memory, without recompiling, by linking with the correct Oracle Solaris libraries and being verified in a test environment. In addition, open APIs for Silicon Secured Memory are available for software developers.

The In-Memory Query Acceleration and In-Line Decompression capabilities, jointly referred to as SQL in Silicon, are combined to maximize the use of memory capacity, bandwidth, and CPU cores, which has a big impact on performance. The SQL in Silicon capabilities are performed by the Data Analytics Accelerator (DAX) engines that are specifically designed into the SPARC M7 chip to accelerate analytic queries.

DAX technology was initially used with the Oracle Database In-Memory option of Oracle Database 12_c_. Oracle has released open APIs for DAX allowing application developers to leverage DAX technology to accelerate a broad spectrum of analytics software.

The following sections describe the Software in Silicon features of the SPARC S7 and M7 processors in more detail and provide examples of how they can benefit your applications.

Detecting Memory Reference Errors and Attacks

Data integrity is a primary area of interest in software development, especially when languages such as C/C++ are used. Speeding up time to market and building robust software applications are also top objectives during software development.

The primary function of the Silicon Secured Memory feature is to detect and report memory reference errors. In multithreaded applications, lots of threads work on large, shared memory segments. While managing the allocation and release of shared memory segments, multithreaded applications can run into various coding problems that are time-consuming to diagnose. Two of these problems are

  • Silent data corruption
  • Buffer overruns

Figure 1 illustrates the silent data corruption problem using a simplified example. There is a memory area that is shared by many processes, and there are two application threads running: Thread A and Thread B. The threads are color coded blue and green, respectively, to demarcate the memory areas that they should be accessing. However, sometimes while the application is running, Thread A starts writing to Thread B's green area by mistake. This can happen if Thread A allocates memory and subsequently frees it, but holds on to a pointer to it. Later, when Thread B later takes ownership of the memory space, if Thread A uses its obsolete pointer, Thread A will act on memory that is now owned by Thread B. If this malfunction is not caught immediately, it will be detected only when Thread B reads that memory location. In such a case, Thread B has data that has been silently corrupted by Thread A. This corruption manifests itself as a software bug with serious consequences, and such silent data corruptions are extremely difficult to diagnose.

f1.jpg

Figure 1. Silent data corruption caused by Thread A acting on Thread B's memory area

Figure 2 illustrates the problem of buffer overruns. In this case, an application has a memory area allocated to it, but it erroneously starts writing or reading data beyond this allocated area. Sensitive data can be corrupted or leaked into other memory locations, and the application that owns the data is not aware of this. For example, a malicious attack could cause a catastrophic security breach that allows another application to read sensitive information that has been mistakenly made available to it.

f2.gif

Figure 2. Buffer overrun caused by Thread B using memory outside of its allocated area

Example of How Silicon Secured Memory Stops Silent Data Corruption

The Silicon Secured Memory feature enables faster detection of memory reference errors such as silent data corruption. As illustrated in Figure 3, a key in each memory pointer is used to indicate the memory version, or "color." During the process of memory allocation, a corresponding code is written to the memory. When this memory is accessed by any pointer, the key of the pointer attempting the access and the code of the memory being accessed are compared by the hardware. If there is a match, the access is legal; if there is no match, the memory reference error is caught immediately.

f3.jpg
Figure 3. Detecting silent data corruption by comparing memory versions

In Figure 3, the memory allocated to Thread A is blue and the memory allocated to Thread B is green. When Thread A attempts to access the green memory area of Thread B (depicted by the dotted blue line), the memory version is compared to the pointer's four-bit pattern. Because the two patterns do not match, the problem is flagged immediately by the processor and relayed to the application, as indicated by the stop sign. This enables the application to take immediate action and drastically reduces the time required for application developers to troubleshoot such memory reference bugs.

Example of How Silicon Secured Memory Stops Buffer Overruns

Similarly, the Silicon Secured Memory feature solves the problem of buffer overruns, as shown in Figure 4. Using the same memory versioning described earlier in this article, if an application attempts to read data from or write data to unallocated memory areas, a flag is raised by the processor, which allows corrective action to be taken right away. This enables faster bug detection as well as improved capability for application developers to develop extremely secure applications.

f4.jpg

Figure 4. Detecting buffer overruns using memory versioning

Accelerating Oracle Database In-Memory Queries

With the growing size of information stored in today's databases, it is increasingly critical to access and use the right information. Compressing stored data to reduce the storage footprint is a key business strategy in every IT system design. At the same time, any enterprise performing analytics to shape its future business opportunities needs the ability to access, decompress, and run fast queries on the stored data. Oracle's software engineers incorporate various features into Oracle Database to effectively leverage the hardware resources—such as cores, I/O channels, and memory—of the underlying server. Oracle Database follows a "shared everything architecture" that allows flexibility for parallel execution and high concurrency without overloading the system.

The In-Memory Query Acceleration and In-line Decompression features of the SPARC S7 and M7 processors were designed to work with Oracle Database In-Memory, which enables existing Oracle Database–compatible applications to automatically and transparently take advantage of columnar in-memory processing without additional programming or application changes.

The primary design goal for Oracle Database In-Memory was fast responses for analytic operations. The traditional way of storing and accessing data in a database employs a row format, which is great for online transaction processing (OLTP) workloads that perform frequent inserts and updates and handle report-style queries. However, analytics run best with a columnar database format. With Oracle Database In-Memory, it is possible to have a dual-format architecture that provides a row format for OLTP operations and a column format for analytic operations.

Oracle Database In-Memory populates the data in the in-memory (IM) column store, which provides various advantages. For example, a set of compression algorithms is automatically run on the data being stored in the IM column, resulting in storage savings. Then, when a query is run, it can scan and filter the compressed data. The volume of data scanned during the query is far less than when the query has to run on uncompressed data. The IM column store creates in-memory compression units (IMCUs), as shown in Figure 5. The in-memory columnar data is fragmented into these smaller IMCUs so that parallelization is possible when a query is run on all the data.

f5.jpg

Figure 5. In-memory compression units

The SPARC S7 and M7 processors optimize Oracle Database In-Memory through a synergistic architecture that uses two key Silicon in Software features:

  • In-Memory Query Acceleration
  • In-line Decompression

In the SPARC M7 processor, Oracle introduced eight data analytics accelerators (DAX), which are hardware units within the processor that are optimized to handle database queries quickly, each with four pipelines, or engines. In the case of the SPARC S7 processor, it includes four data analytics accelerators (DAX), also with four pipelines, or engines. The accelerator engines can process a total of 32 independent data streams in a SPARC M7 processor (and 16 in the S7 processor), offloading the processor cores to do other work. These accelerator engines are in addition to the cores that are present in both SPARC processors, and can process query functions such as decompress, scan, filter and join. Each thread in a SPARC M7 or S7 core has access to all the accelerator engines and can utilize them for various functions. Each accelerator is connected to the on-chip level 3 (L3) cache for very fast communication with the cores.

Example of How In-Memory Query Acceleration Speeds Up Database Queries

Imagine that one afternoon a manager is working on a quarter-end report when, for some reason, her boss asks her to go count the number of cars in the parking lot. The manager could go out to the lot and start counting the cars from start to finish, or she could call upon the help of team members. This relieves the manager from the job of counting parked cars, and she can get back to working on that important report. In a few moments, the team members give the manager the result, which is then happily passed on to the boss.

Similarly, in the SPARC S7 or M7 processors, when a core receives a database query, it can offload it to an accelerator engine. After the query is offloaded to the accelerator engine, the core is free to resume other jobs, such as higher-level SQL functions. The accelerator engine runs the query and gathers the result, which it puts in the L3 cache for fast communication with the core. The core is notified about the completion of the query, and it picks up the result from the L3 cache. This query offload feature provides extremely fast query processing capabilities for the processor while freeing the cores to do other functions.

The other advantage of this query offload is the parallelization that is facilitated by the 32 accelerator engines within each SPARC M7 processor (16 in the SPARC S7 processor). Each core has access to all the accelerator engines and can use them at the same time to run a single query in a completely parallel fashion. This parallelism is done by the processor and does not require the application code or the database application to perform any extra operations. The accelerator engines can take data streams directly from the memory subsystem through the processor's extremely high-bandwidth interfaces—which can reach 160 GB/sec. Therefore, queries can be performed on in-memory data at top speeds determined by the memory interface, rather than being controlled by the cache architecture that connects to the processor cores.

As mentioned above, Oracle Database In-Memory organizes the columnar data in chunks of compressed data called in-memory compression units so the data can be worked on in parallel, rather than it being one big set of data.

The following example demonstrates all the query optimizations and accelerations that happen when an Oracle Database In-Memory query is run on a SPARC S7 or M7 processor. Suppose a query is being run to find 'Total Number' of cars of the brand 'ABC' sold in the year '2005' over a set of data using the in-memory column store.

| The first optimization, which is provided by the columnar format, is realized right away. Only two columns need to be accessed: the column called 'Car Brand' and the column called 'Year of Sale.' Unlike with the row format, it is not necessary to access every column present in the row.

The second optimization is that each of the IMCUs has storage indexes that are automatically created and maintained for every column. These storage indexes maintain a minimum and maximum value. In this example, the goal is to find cars sold in the year 2005. If '2005' is not a value in the range maintained by the storage index for the column 'Year of Sale,' then that IMCU need not be accessed at all. This is a significant advantage, because a quick comparison of '2005' with the minimum and maximum value of the storage index prunes down the number of IMCUs that need to be accessed.

The third optimization is specific to the SPARC S7 and M7 processors, which are engineered to take advantage of the in-memory column store. When the query comes to a SPARC S7/M7 processor core, the core has accelerator engines at its disposal. It can work on multiple IMCUs at the same time, because it can assign each IMCU to a different accelerator engine. Instead of the traditional way, where the core thread scans and evaluates one IMCU block at a time, the SPARC S7 and M7 processors can take advantage of the parallelization option provided by the accelerator engines. In a different scenario, when multiple queries are run at the same time on a SPARC S7 or M7 processor core, the core can still assign different queries to different accelerator engines and achieve a significant performance boost because of this parallelism. Each accelerator engine then reads the relevant IMCUs directly from memory, processes the query, and returns the value to the processor core through a cache operation. |

Example of How In-line Decompression Speeds Up Database Queries

Typically database reads are a lot more frequent than database writes, and this is especially true with analytics. Most of the time, the data written to a database is stored in a compressed format to conserve storage space. This implies that when a query needs to be run on a data set contained in a database, data decompression is required first. Although Oracle Database and other databases available today employ smart techniques to optimize when decompression is done, decompression inherently incurs performance overhead. Compression ratio is the ratio by which the database stores uncompressed data in a compressed format. If the uncompressed data is 2 TB and the compression ratio is 4:1, then the stored compressed data occupies 0.5 TB of space.

Going back to the car query example mentioned above, in the first step, 1 GB of IMCU data in compressed form is brought into the processor and evaluated in its compressed form. If half of the entries contain the 'Year of Sale' as '2005,' then 0.5 GB of data is written out. In Step 2, this 0.5 GB is read back into the processor and decompressed for providing the result. If the compression ratio is 4:1, then the processor is writing out 2 GB of data as the final result. The following summarizes all the substeps:

  • The core reads 1 GB of compressed data in the IMCU.
  • The core scans and filters data in compressed format for the IMCU.
  • If, hypothetically, half of the entries have 'Year of Sale' as '2005,' the core writes out 0.5 GB of compressed data.
  • The core reads this 0.5 GB of compressed data.
  • The core decompresses this 0.5 GB of scanned and filtered data and writes out 2 GB of uncompressed data as the final result.

SPARC processor engineers and Oracle Database developers identified this as a performance bottleneck and not an optimal way of using memory bandwidth, because a huge amount of data is being read and written for the two-step process.

The SPARC S7 and M7 processors have the capability of decompressing data and running the query function in a single step using its accelerator engines, as shown in Figure 6. This provides an immense performance boost, because multiple reads and writes do not need to be done. The sequence is as simple as the following:

  • The core offloads query work to an accelerator engine, which reads 1 GB of compressed data.
  • The accelerator engine decompresses the data on the fly and evaluates the query in a single step without any additional read or write operations.
  • The core writes out 2 GB of uncompressed data as the final result.

The extra overhead of reads and writes for evaluating the query is circumvented. This speeds up the overall query execution time as well as optimizes the efficient use of memory bandwidth by skipping unnecessary decompression read and write steps.

Figure_6_DAX_engine.jpg

Figure 6. The data analytics accelerator engines contain query and in-line decompression functions

The ability of the core to offload database queries to the accelerator engines in a highly parallel manner, along with the ability to decompress data sets on the fly, provides a unique performance advantage to the SPARC S7 and M7 processors. An example of the performance improvement is shown in Figure 7, in which the performance of the In-line Decompression capability of the SPARC M7 processor is compared to the performance of the previous-generation SPARC T5 processor.

f7.jpg

Figure 7. 10x performance improvement through the in-line decompression process

See Also

| Revision 1.0, 10/22/2015 |

Follow us:
Blog | Facebook | Twitter | YouTube

Comments

807580
I have to admit that I do not know how to determine if some bytes constitute a legitimate UTF-8 value, perhaps there is something in Character that might help.

However this if statement can't be what you want since as far as I can tell, it can never be true.
if(fileContent<0 && fileContent[i]!=10 && fileContent[i]!=13 && fileContent[i]>300)
What single value will satisfy the first and third conditions?

Edited by: johndjr on Oct 23, 2009 8:26 AM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
800308
Sanath,

It is my considered opinion that you're in way over your head. That code has some really basic noob mistakes.

Pop quiz:

1. For what value(s) of x is this statement true (x<0 && x>300)? Putz!

2. How many bytes (a signed 8-bit integer value) exceed 300?

3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents. Why do you imagine that you are special (and therefore this technique will work reliably for you anyways?

4. How where you planning to determine the size of each character by reading bytes? Whitefella mojo maybe? You might also want to try a nice Séance, but I don't much like your chances there either.

5. Have you ever considered a career in the armed services?

Cheers. Keith.
JoachimSauer
corlettk wrote:
2. How many bytes (a signed 8-bit integer value) exceed 300?
301 bytes exceed 300 bytes!
sanath_k
My challenge is to spot the multi-byte character hidden in this big dat file.
This is because the data entry team asked me to spot out the record and column that has issue out of
lakhs of records they sent inside this file.

Lets have the validation code like this...
   if( (fileContent<0 && fileContent[i]!=10 && fileContent[i]!=13) || fileContent[i]>300) // if invalid
{System.out.println("failure at position: "+i);break;}
< 0 - As I tested, some chars generated -ve values for some codes.
300 - was a try to find out if any characters exceeds actual chars. range.
10 and 13 are for line-feed. any alternative (better code ofcourse) way to catch this black sheep ?
sanath_k
My challenge is to spot the multi-byte character hidden in this big dat file.
This is because the data entry team asked me to spot out the record and column that has issue out of
lakhs of records they sent inside this file.

Lets have the validation code like this...
   if( (fileContent<0 && fileContent[i]!=10 && fileContent[i]!=13) || fileContent[i]>300) // if invalid
{System.out.println("failure at position: "+i);break;}
lessthan 0 - I saw some -ve values when I was testing with other files.
greaterthan 300 - was a try to find out if any characters exceeds actual chars. range.
if 10 and 13 are for line-feed.

with this, I randomly placed chinese, korean characters and program found them.
any alternative (better code ofcourse) way to catch this black sheep ?

Edited by: Sanath_K on Oct 23, 2009 8:06 PM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
807580
Sanath_K wrote:
   if( (fileContent<0 && fileContent[i]!=10 && fileContent[i]!=13) || fileContent[i]>300) // if invalid
{System.out.println("failure at position: "+i);break;}
lessthan 0 - I saw some -ve values when I was testing with other files.
greaterthan 300 - was a try to find out if any characters exceeds actual chars. range.
if 10 and 13 are for line-feed.

with this, I randomly placed chinese, korean characters and program found them.
any alternative (better code ofcourse) way to catch this black sheep ?
A less obfuscated way of doing that would be
   if( (fileContent&0x80)!=0 ) // if not ASCII-7
{System.out.println("failure at position: "+i);break;}
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
807580
corlettk wrote:
Sanath,

It is my considered opinion that you're in way over your head. That code has some really basic noob mistakes.

Pop quiz:

1. For what value(s) of x is this statement true (x<0 && x>300)? Putz!

2. How many bytes (a signed 8-bit integer value) exceed 300?

3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents. Why do you imagine that you are special (and therefore this technique will work reliably for you anyways?

4. How where you planning to determine the size of each character by reading bytes? Whitefella mojo maybe? You might also want to try a nice Séance, but I don't much like your chances there either.

5. Have you ever considered a career in the armed services?
6. How much data do you think you've read when you do this:
fin.read(fileContent);
You might have read as little as one byte, meaning you're skimming over all but one byte of the file.
sanath_k
from right-click, file, properties , I found size : 12512196 bytes
same is the response from file.length, byte array size before for loop and finally the value I am printing to verify, i.e. totalcharacters.
from this, I felt it is ok to goahead for checking each byte value as aim is to locate the first speacial character.
807580
Sanath_K wrote:
from right-click, file, properties , I found size : 12512196 bytes
same is the response from file.length, byte array size before for loop and finally the value I am printing to verify, i.e. totalcharacters.
from this, I felt it is ok to goahead for checking each byte value as aim is to locate the first speacial character.
If this is a disposable program, fine. But you're doing it wrong. And you still have serious issues with how you actually read in data. Namely, you look through every byte of fileContent but it's more than likely that only the first few bytes actually contain data from your file.
807580
If you want the entire contents of the file in a byte array, here's how you can do it:
FileInputStream fin;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
int len;
byte[] buf = new byte[1024];
while ( (len = fin.read(buf)) != -1 ) {
   baos.write(buf, 0, len);
}

byte[] fileContents = baos.toByteArray();
But you're probably fine looking at it chunk-by-chunk.

In fact, if you're only interested in doing it byte-by-byte, just do this:
BufferedInputStream bin = new BufferedInputStream(fin);
for (int b = -1; (b = bin.read()) != -1; ) {
  //deal with this byte
}
Edited by: endasil on 23-Oct-2009 11:36 AM
sanath_k
lot of helpful comments on the logic...thanks.
question still hunts...
is there any way we can seek help from java code to locate multi-byte characters in a dat file ?
I hope to help the data-entry team to rectify the error and re-process the file.
807580
Sanath_K wrote:
lot of helpful comments on the logic...thanks.
question still hunts...
is there any way we can seek help from java code to locate multi-byte characters in a dat file ?
I hope to help the data-entry team to rectify the error and re-process the file.
This is UTF-8 encoded text? Look at each byte. If the high bit is set, it's a participant in a multi-byte character. [See here|http://en.wikipedia.org/wiki/UTF-8#Description]. tschodt tells you how to check for this in a previous reply.
DrClap
If you want to scan a sequence of bytes from a file and identify locations where a subsequence of two or more bytes would represent a UTF-8 character which isn't in the ASCII character set, then the simplest way is to ask Java to do the UTF-8 part:
Reader r = new InputStreamReader(new FileReader(file), "UTF-8");
int character = 0;
while ((character = r.read()) >= 0) {
  // here we have a stream of characters decoded using UTF-8
  if (character > 127) {
    // this one isn't ASCII
  }
}
807580
DrClap wrote:
If you want to scan a sequence of bytes from a file and identify locations where a subsequence of two or more bytes would represent a UTF-8 character which isn't in the ASCII character set, then the simplest way is to ask Java to do the UTF-8 part:
Assuming the file contains valid UTF-8.
If the file contains [invalid byte sequences|http://en.wikipedia.org/wiki/Utf-8#Invalid_byte_sequences] you get UTFDataFormatException.
DrClap
tschodt wrote:
DrClap wrote:
If you want to scan a sequence of bytes from a file and identify locations where a subsequence of two or more bytes would represent a UTF-8 character which isn't in the ASCII character set, then the simplest way is to ask Java to do the UTF-8 part:
Assuming the file contains valid UTF-8.
If the file contains [invalid byte sequences|http://en.wikipedia.org/wiki/Utf-8#Invalid_byte_sequences] you get UTFDataFormatException.
That's true, and certainly a possibility. But we don't know whether that's the OP's problem. All we have is some guff about "multi-byte" characters. If I were doing this -- well I wouldn't be doing this because I would get around to asking the right questions -- I would start with that, then if it threw an exception I would change it to count the number of characters read before the exception was thrown.
796440
corlettk wrote:
3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents.
Erm?
807580
jverd wrote:
corlettk wrote:
3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents.
Erm?
Think he was thinking InputStream.available(). I wouldn't use File.length either way, because then you can't use it against pipes, etc.
807580
I don't see one.

But it is nothing more than obvious that you shouldn't declare a bytearray whose length is exact the file's length. That's lot of fun if the file length exceeds the available JVM heap memory.
796440
BalusC wrote:
I don't see one.

But it is nothing more than obvious that you shouldn't declare a bytearray whose length is exact the file's length. That's lot of fun if the file length exceeds the available JVM heap memory.
It's not the byte array that's the problem in that case, but rather the fact that you're gong to read the whole file. However, if you do decide to read the whole file, and if you know it's a regular file, not a pipe or something, and if you can assume that the size won't change while you're reading it, then declaring a byte[] of exactly the file's length would be a good way to do it.

It's not the way I'd normally read a file, but I wouldn't rule it out.
DrClap
BalusC wrote:
But it is nothing more than obvious that you shouldn't declare a bytearray whose length is exact the file's length. That's lot of fun if the file length exceeds the available JVM heap memory.
Even more fun if the file length exceeds the maximum value of an integer and it gets truncated to fit in an integer, which you then use as the size of your array.
Sanath_K wrote:
Hello All

While reading data from DB, our middileware interface gave following error.
java.sql.SQLException: Fail to convert between UTF8 and UCS2: failUTF8Conv

I understand that this failure is because of a multi-byte character, where 10g driver will fix this bug.
I suggested the integration admin team to replace current 9i driver with 10g one and they are on it.
Any such problem would of course not be caused by a character but rather by a character set.

Although it is possible that an Oracle driver has a bug in it, the Oracle drivers have been handling character set conversions for years.
Naturally if you do not set up the driver correctly then it will cause a problem.
Sanath_K wrote:
My challenge is to spot the multi-byte character hidden in this big dat file.
..
with this, I randomly placed chinese, korean characters and program found them.
any alternative (better code ofcourse) way to catch this black sheep ?
That of course is ridiculous.

Bytes are encoded to represent a character set. If you have a data file with text in it then it must be because the data file has at least one and perhaps more character sets.

Attempting to detemine the character set of a file is generally a non-deterministic problem - a computer cannot determine the answer all the time.
It will definitely not be able to do it with a misplaced single character from another character set.

So your problem is actually one of the following
- Determine what the character set is rather than using the wrong one.
- Attempt to scrub the incoming data to remove garbage (because that is what wrong character set characters would be) from the data before attempting to insert it into the database. And provide enough error detection that problems can be dealt with manually. In this case you are NOT attempting to recognize characters from another character set but rather excluding anything that doesn't fit into the set that you are using.
- Make the source of the file start producing a file in a single/correct character set.
800308
JoachimSauer wrote:
corlettk wrote:
2. How many bytes (a signed 8-bit integer value) exceed 300?
301 bytes exceed 300 bytes!
EPIC FAIL ;-)
800308
Sanath_K wrote:
lakhs of records
I'm stealing that word. It's just so... ummm.... "pithy"!
A lakh (English pronunciation: /&#712;læk/ or /&#712;l&#593;&#720;k/; Hindi: &#2354;&#2366;&#2326;, pronounced [&#712;la&#720;k&#688;]) (also written lac) is a unit in the Indian numbering system equal to one hundred thousand (100,000; 105). It is widely used both in official and other contexts in Bangladesh, India, Maldives, Nepal, Sri Lanka, Myanmar and Pakistan, and is often used in Indian English.
~~ http://en.wikipedia.org/wiki/Lakh
800308
Folks,

Combining the just-read-single-bytes technique suggested by endasil with the bit-twiddling-highorder-test suggested by tschodt I get:

(*NOTE:* The source code is it's own test-data. It contains two "extended characters" in the comment, and therefore must be saved in a UTF-8 (or equivalent) encoded file, and compiled using [javac's|http://www.manpagez.com/man/1/javac/] -encoding UTF8 argument.)

ExtendedCharacterDetector.java
package forums;

import java.io.FileInputStream;
import java.io.BufferedInputStream;

// ÊPIC FÃIL!

public class ExtendedCharacterDetector
{
  public static void main(String[] args) {
    String filename = args.length>0 ? args[0] : "ExtendedCharacterDetector.java";
    try {
      BufferedInputStream input = null;
      try {
        input = new BufferedInputStream(new FileInputStream(filename));
        int count = 0;
        for ( int i=0,b=-1; (b=input.read()) != -1; ++i ) {
          if ( (b&0x80) != 0 ) {
            System.out.println("byte "+i+" is "+b);
            count++;
          }
        }
        System.out.println("Number of bytes exceeding "+0x80+" = "+count);
      } finally {
        if(input!=null)input.close();
      }
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}
output
byte 94 is 195
byte 95 is 138
byte 101 is 195
byte 102 is 131
Number of bytes exceeding 128 = 4
This still isn't very useful though, is it? Bytes 94 and 95 are "busted"... Goodo! So WTF is byte 94? I suppose you could download a [free HEX editor|http://www.google.com/search?q=free+hex+editor] and use that to spy-out the offending characters... (I use NEO, it works).

So... presuming that you know what encoding the file is supposed to be in... I am still of the humble opinion that your users would be better-served if you read characters from the file, and reported character offsets and values to the user... the "extendedness test" remains logically the same... extended (non 7-bit-ascii) characters have a value exceeding 128 (2^7), and this (AFAIK) the same no matter which charset the character has been encoded in, because all modern charsets use the same "the basic ascii table" (i.e. code points <=128).

Cheers. Keith.
807580
It occurs to me that the underlying codecs, defined in java.nio.charset or thereabouts, work between ByteBuffer and CharacterBuffer, so if you put your data file contents into a ByteBuffer (or map the file to a ByteBuffer) and run the decoder directly it should leave the buffer pointers at the point of failure.

Could save you a lot of bit twidling.

But I'm not sure how knowing the offset of the problem is is going to help you. The likely cause is that you've somehow told the middleware that this text is UTF-8 when it's in some other encoding, and byte offsets won't help you there.

In other words you probably have a configuration problem.
807580
corlettk wrote:
Sanath_K wrote:
lakhs of records
I'm stealing that word. It's just so... ummm.... "pithy"!
A lakh (English pronunciation: /&#712;læk/ or /&#712;l&#593;&#720;k/; Hindi: &#2354;&#2366;&#2326;, pronounced [&#712;la&#720;k&#688;]) (also written lac) is a unit in the Indian numbering system equal to one hundred thousand (100,000; 105). It is widely used both in official and other contexts in Bangladesh, India, Maldives, Nepal, Sri Lanka, Myanmar and Pakistan, and is often used in Indian English.
~~ http://en.wikipedia.org/wiki/Lakh
Laks is Dutch for "extremely negligent".
DrClap
BalusC wrote:
Laks is Dutch for "extremely negligent".
Just like the English word lax only more emphatic.
sanath_k
As DBA suggested, the file was extracted with the the select query, where CONVERT (column_name, 'UTF8', 'WE8ISO8859P1') was used and this resolved the issue in file processing. It was done for all the description columns since we couldn't spot the exact field.
791266
Sanath_K wrote:
As DBA suggested, the file was extracted with the the select query, where CONVERT (column_name, 'UTF8', 'WE8ISO8859P1') was used and this resolved the issue in file processing. It was done for all the description columns since we couldn't spot the exact field.
I wonder why considering that you had so many bugs in your original code.
sanath_k
Concern from the quality standpoint is that using convert function for all columns instead of finding out just one record's field where data could possibly be corrected is a much time consuming process.
sanath_k
Thanks for all the valuable responses.
1 - 32