1 person found this helpful
We have developed a java application, whose primary objective is to read a file(input file), process it and convert it into set of output files.
(I have given a generic description of our solution, to avoid irrelevant details).
Unfortunately, your generic description has also avoided 'relevant' details - such as what you are actually doing to 'process' the file.
Can you please suggest us ways to avoid OutOfMemory Exceptions, what is the general practice when developing these kind of applications.?
General practice is to architect solutions that are scalable as well as performant.
Based only on what you posted my first thought is that your 'solution' involves reading the entire file into memory. If so then that solution, as your out-of-memory issue probably indeicates, doesn't scale. There will always be a file too big too read into memory.
But without knowing what steps you are now using to 'process' one source file and produce one or multiple output files no specific suggestions are possible.
Even an enterprise-class database like Oracle doesn't process large amounts of data like that strictly in memory and that database can handle terabytes of data.
Post a description, in English, of what the process needs to be and we can help you find a more appropriate architecture.
Thanks for your reply.
The solution involves loading relevant information into the memory (not the entire 130GB file is loaded.)
I'll try to be more specific with my problem here.
We are converting OSM file (OSM file of North America, File Size:130GB, this is a xml file which contains structured map data.)
The relevant parts of the file are read into memory to build a list of nodes,ways,edges and adjacency list. As I said this solution works for Minnesota dataset (4GB file), but for the larger files the application breaks. The application uses hashmaps for lookups to generate ways,edges and ajacecny list.
You still seem to rather stingy with the specifics.
Helping you isn't going to be very efficient if we have to keep 'guessing' as to what you might be doing.
An XML file can be processed in a lot of ways. Two common ones to read the 'relevant parts' of an XML file into memory are: 1) use a DOM parser and 2) use a SAX parser. The DOM parser will be VERY memory and resource intensive for a large file while the SAX parser will be relatively scalable.
The limited symptoms you describe match those of a DOM issue with larger files. They also match those of an app that just keeps creating lots and lots of objects and never clearing the references to them. Profile the app and see if the number of objects just keeps growing and growing.
But not having a clue as to whether you are even using either of those makes it impossible to help further or even direct you to any documentation, tutorials or examples that might be of help.
Have you reviewed information available on sites such a Matlab Central?
This software package includes functions for working with OpenStreetMap XML Data files (extension .osm), as downloaded from http://www.openstreetmap.org, to:
1) Import and parse the XML data file and store the parsed data in a MATLAB structure. This data represents the graph of the transportation network.
2) Plot the MATLAB structure to get a visualization of the transportation network, its nodes and their labels.
3) Extract the adjacency matrix of the directed graph representing the network's connectivity (i.e., road intersections).
4) Find shortest routes between nodes within the network. Note that distance is measured as the number of transitions between intersection nodes, not over the map.
Development on GitHub:
and releases there and here.
An example map.osm is included to be used with usage_example.m
I'll try to give a brief description of the problem.
1. The program reads the OSM (xml file) and parses it using Java Stax, (This is a scalable parser)
2. While reading file program populates a Hashmap with id as key and Landmark as value (Landmark contains latitude, longitude, and id) and another hashmap for streets id as key and reference to landmarks as the street along the value.
3. Using the above 2 hashmaps WeigtedPseudoGraph (from JGrapht package) is constructed. This graph is used to check connectivity of the graph, to remove bogus edges etc..
4. At this stage the program fails because of Out of memory exception.
Thanks - now we're making some progress!
The above suggests that troubleshooting would start by instrumenting the code to provide some basic metrics and info for each of those three steps.
So use a profiler (or write some simple Java code) to determine the memory consumption and number of objects for each step.,
Alter the code so that you can stop after each step if you wish:
1. does this step succeed?
2. How much memory was used? - capture the memory before the step begins also
3. how many objects were created?
1. does this step succeed?
2. how much memory was used?
If the amount of memory and number of objects for those two steps seems reasonable then, as you suspect, it may be step 3 that is causing the issue.
I suggest that you try to determine the approximate growth curve of the memory being used for each step. Check the memory at each step using a VERY small file, then a bigger file, then your big file that works and then your BIG file that doesn't work. What happens to memory use when the number of objects/edges/etc is doubled or tripled? Does there seem to be any correlation between the size of the file, the number of nodes and the amount of memory?
As Wizzle suggested it may be the algorithm being used. It could be creating a lot of unnecessary objects or it might not be releasing those objects after the algorithm has completed working with them.
You should also post on any JGraphT forums or that home site/wiki if you haven't done so already. In particular ask if anyone using that library has (successfully) worked with files/objects as large as what you are working with.
You could also explore any JGraphT specific configuration parameters that might be available.
It is also possible that the library code has managed to get into a circular reference loop that it can never get out of.
You can toss any kind of complex algorithm and data structure at this that you want to get your money's worth out of what you have, it will in the end boil down to basic logistics. There is only so much you can stuff into one storage container. Come the time where you need to store more stuff and the container is overflowing with even the most efficient storage strategy, you need to rent another container.
Thanks a lot for all your prompt replies.
We have decided to investigate Redis This will solve the problem described in the post. All the hashmaps can be put into databases(Secondary memory) and Memory exceptions can be handled.
We have decided to investigate Redis This will solve the problem described in the post.
Well - good luck with that.
Please post the reasoning that led you to that conclusion.
Based on what you have posted so far you haven't determined where the problem actually occurs. The most likely source is NOT the storage of the hashmaps but the algorithm being used by that 'other' code.
The most useful and productive thing you could do is troubleshoot, as suggested earlier, to determine just WHAT is causing the problem.