This content has been marked as final. Show 9 replies
Mohan wrote:Yes, we profile multi-threaded java applications very often and I think you will find the profiling data very useful.
I am planning to use the studio in Ubuntu and analyze parallel Java programs and the way threads are used effectively in a multic-core environment. Is that feasible ? Is there any reference material that explains how to do this in the studio ?
For simple profiling you can run:
collect -j on java ...
where "..." is what you usually pass to java to run your application.
If you use a script to start your application, you have to create a copy of this script and insert "collect -j on " before the java invocation.
It is important to use Oracle JDK 1.6 or 1.7, because they support API that we use to get user's call stacks.
Here is an example that shows call tree of Netbeans IDE:
% er_print -ctree test.1.er
| +- 2.501 (13%) org.netbeans.core.startup.TopThreadGroup.run()
| | +- 2.501 (13%) org.netbeans.core.startup.Main.start(java.lang.String)
| | +- 2.180 (11%) org.netbeans.core.startup.Main.getModuleSystem()
| | | +- 1.510 (8%) org.netbeans.core.startup.ModuleSystem.restore()
| | | | +- 1.510 (8%) org.netbeans.core.startup.ModuleList.trigger(java.util.Set)
| | | | +- 1.500 (8%) org.netbeans.core.startup.ModuleList.installNew(java.util.Set)
| | | | | +- 1.440 (7%) org.netbeans.ModuleManager.enable(java.util.Set)
| | | | | | +- 1.120 (6%) org.netbeans.core.startup.NbInstaller.load(java.util.List)
| | | | | | | +- 0.380 (2%) org.netbeans.core.startup.NbInstaller.loadCode(org.netbeans.Module, boolean)
| | | | | | | +- 0.380 (2%) org.netbeans.core.startup.NbInstaller.loadLayers(java.util.List, boolean)
| | | | | | | +- 0.290 (1%) org.netbeans.core.startup.Main.initUICustomizations()
| | | | | | | +- 0.050 (0%) org.netbeans.core.startup.MainLookup.modulesClassPathInitialized()
| | | | | | | +- 0.010 (0%) org.netbeans.core.startup.CoreBridge.getDefault()
| | | | | | | +- 0.010 (0%) org.netbeans.core.startup.NbInstaller.checkForDeprecations(java.util.List)
| | | | | | +- 0.160 (1%) org.netbeans.StandardModule.classLoaderUp(java.util.Set)
| | | | | | +- 0.060 (0%) org.netbeans.core.startup.NbInstaller.classLoaderUp(java.lang.ClassLoader)
| | | | | | +- 0.050 (0%) org.netbeans.core.startup.NbInstaller.prepare(org.netbeans.Module)
| | | | | | +- 0.040 (0%) org.netbeans.ModuleManager.simulateEnable(java.util.Set)
| | | | | | +- 0.010 (0%) org.netbeans.ProxyClassLoader.append(java.lang.ClassLoader)
| | | | | +- 0.060 (0%) org.netbeans.ModuleManager.simulateEnable(java.util.Set)
| | | | +- 0.010 (0%) org.netbeans.core.startup.ModuleList.flushInitial()
You can select any function and navigate to its source, byte code, machine instructions (in "Machine" mode).
There are tabs that show Threads (how long each thread worked), Lines (hot lines), PCs (hot addresses).
You can see all threads in the Timeline tab - it shows what each thread did at the moment it was profiled.
You can also use HW counters to find cache misses and other information.
Herlihy and shavit's book seems to be a good reference.There is a lot of documentation. You can start from this page:
There is a very good demo created by Marty Itzkowitz:
Can I run my Java programs in specialized multi-core environments like the Google app. engine and analyze the execution dump in my Ubuntu desktop which has studio ? Hope I am not on the wrong track here.If this environment uses many machines - then I don't know if "collect" can handle it properly.
I am also looking for some advice about how to go about this.
The main problem is that it has to save the results in some directory, and if this directory is
accessable from all these machines, then everything may work just fine, but we did not
test this case.
Please, let us know if you have any problem with java profiling.
I was able to profile my Java program using the thread analyzer. More specifically I was using the fork join framework and I was looking for data to understand how a Java thread gets allocated to a particular core. This is just to understand and view how multiple cores are utilized when a work stealing algorithm runs.
Are there any specific recommendations for this ?
You can see threads in Timeline tab, and you can also switch from threads to cpus in this tab.
If you are interested in one thread (how it migrated from one cpu to another), you can open
Threads tab, select this thread and set filter to exclude other threads (using context menu).
Then you can see how this threads migrated from one cpu to another and which cpus it
used in Timeline tab and in CPUs tab.
Thanks,. I think this project http://developers.sun.com/speedway/ on the cloud provides the studio tools but it seems to be on the backburner. This seems to be a way of using multi-core machines on the cloud for testing.
Is there documentation to understand the effect on CPU caches using the studio ? I am basically just beginning to understand these concepts. Any test programs that demonstrate effects on caches ? I understand these are specific to the hardware and software we execute.
Mohan wrote:Yes, you can take a look at "man collect" - it has a section about HWC profiling.
Is there documentation to understand the effect on CPU caches using the studio ?
Also there are some documents available on web, for example this one:
Basically what you need is to find out which HW counters are available on you system.
You can use "collect" without arguments to find this information,
After that you can decide which of them are important for you:
L1 cache misses
L2 cache misses
L3 cache misses
On some systems you can profile all of them during one run.
On other systems you can profile some of them, and then rerun your application to profile other counters.
I think I can easy create a test that shows some cache problems, like false sharing, but it is more interesting
to see what kind of problems your application has. Anyway I'll post a simple example of a cache problem soon.
Edited by: NikMolchanov on Nov 12, 2011 12:11 AM
Very appreciable. The PPT is good but also quite advanced from the perspective of a Java developer. Not sure if your examples can show a false sharing Java thread program instead of C.
I have read page 476. of 'The art of MutiProcessor Programming' by Herlihy and Shavit where there is a single para. on false sharing. Maybe profiling the example in page 152 using the analyzer is good proof for a beginner ?
I am trying to demonstrate the utility of profiing on multiple cores but I don't have any real application.
Edited by: Mohan on Nov 14, 2011 7:46 AM
Edited by: Mohan on Nov 14, 2011 7:48 AM
Yes, you are right, I was going to post a simple C application with 3-4 POSIX threads to illustrate a false sharing problem.
But it is interesting to illustrate it in a Java program as well. I'll try and let you know.
I think I have an idea about Java code to induce false sharing. But the thread analyzer seems to require code instrumentation. Can I view thread id's, VA's and cache lines like it is shown in pages 35 and 36 of http://cscads.rice.edu/workshops/summer09/slides/performance-tools/DProfile.cscads.pdf ? An example will surely help.