12 Replies Latest reply on Sep 10, 2009 9:22 PM by 843789

    huge XML file

    843789
      Hello,
      I have a program that must store on file a lot of words and information about them; I chose as outout format XML; but there are so many words
      that the file is 10MB; this under is an example of a row (there are 80000 like that); is there a way (XML) to save space? (I don't know, maybe changing something inside the XML)
      <word hampost="2.7340257709269167E-6" spampost="2.255630617930008E-6" hamFreq="1" spamFreq="0"> ferreirinho </word>
      thanks,
        • 1. Re: huge XML file
          Tolls
          Zip it up?

          XML is text so, short of reducing the element and attribute names as much as possible (not something I'd really recommend), zipping is probably your best bet.
          • 2. Re: huge XML file
            843789
            If space is a concern, maybe XML isn't the right format. It was never meant to be brief. There are binary XML formats around that might help. Google that term: "binary XML"
            • 3. Re: huge XML file
              800285
              You could make your tag names shorter ;-)
              <w hp="2.7340257709269167E-6" s="2.255630617930008E-6" gf="1" sf="0"> ferreirinho </w>
              But if you're looking to shrink the size of your output, it's probably better NOT to go with XML at all. XML is good because it's human readable--NOT because it's efficient with computer resources.

              Edited by: mangst on Sep 10, 2009 7:16 AM
              • 4. Re: huge XML file
                843789
                Of course, the real question here is, have you actually experienced any problems with such a huge file?
                • 5. Re: huge XML file
                  843789
                  I have no problem with that file at all; just I don't like it so big; it must be read by another application. So do you suggest to zip and unzip while reading it?

                  I was wondering too if it was a good idea to keep double with a such precision...(maybe trunc them?)

                  Edited by: mickey0 on Sep 10, 2009 7:28 AM
                  • 6. Re: huge XML file
                    843789
                    mickey0 wrote:
                    I have no problem with that file at all; just I don't like it so big; it must be read by another application.
                    Not really a problem, then, is it? Seriously, I wouldn't worry about it unless it's actually causing you a problem. What's this other app? Can you easily change it?
                    I was wondering too if it was a good idea to keep double with a such precision...(maybe trunc them?)
                    Depends entirely on your needs. It doesn't sound like a good idea, truncating data just to fit it into something, but that's your call
                    • 7. Re: huge XML file
                      843789
                      It seems to me that you would use XML with either a DTD or schema to enforce and guarantee that the content conforms to a particular format. If the file is just a bunch of words with no well defined structure that you want to enforce, you shouldn't send it as XML.
                      • 8. Re: huge XML file
                        800285
                        georgemc wrote:
                        mickey0 wrote:
                        I have no problem with that file at all; just I don't like it so big; it must be read by another application.
                        Not really a problem, then, is it? Seriously, I wouldn't worry about it unless it's actually causing you a problem. What's this other app? Can you easily change it?
                        I was wondering too if it was a good idea to keep double with a such precision...(maybe trunc them?)
                        Depends entirely on your needs. It doesn't sound like a good idea, truncating data just to fit it into something, but that's your call
                        I think he maybe means that he has no trouble parsing the data, but the size is an inconvenience. Do you have to transmit the data over a network connection of some sort? If so, then yes, it would probably be beneficial to make the data as small as possible.
                        • 9. Re: huge XML file
                          843789
                          mangst wrote:
                          I think he maybe means that he has no trouble parsing the data, but the size is an inconvenience.
                          I don't see why you think that. He's only said he simply doesn't like it being so big.
                          Do you have to transmit the data over a network connection of some sort? If so, then yes, it would probably be beneficial to make the data as small as possible.
                          I disagree, to an extent. That would be true only if the overhead of transport was prohibitive. It depends on what's going on. For example, if you're shunting 50MB once an hour to another machine, to stick it in a database for future reference, there's little to be gained from optimising the transport. Without context it's impossible to say "you *must* do so-and-so", and the only context so far is that the OP has an irrational dislike for verbose data that otherwise works fine. Besides, several suggestions along these lines have been made, to no avail, leaving me wondering what our OP is actually expecting
                          • 10. Re: huge XML file
                            800285
                            georgemc wrote:
                            For example, if you're shunting 50MB once an hour to another machine, to stick it in a database for future reference, there's little to be gained from optimising the transport.
                            I disagree good sir. What if the source machine is connected to the destination machine thru a, say, 56K modem connection (highly unlikely, but nevertheless possible)? You must also take into account the speed of the network connection.
                            • 11. Re: huge XML file
                              843789
                              mangst wrote:
                              georgemc wrote:
                              For example, if you're shunting 50MB once an hour to another machine, to stick it in a database for future reference, there's little to be gained from optimising the transport.
                              I disagree good sir. What if the source machine is connected to the destination machine thru a, say, 56K modem connection (highly unlikely, but nevertheless possible)? You must also take into account the speed of the network connection.
                              Exactly. You're not disagreeing with me at all, you're expanding on my point. Adding "what-ifs" after the fact is giving more context than we originally have, but even in this contrived scenario, you still don't have a conclusive "we must optimise". If the current transfer is taking so long that the consuming application suffers, or there is undue cost involved in it taking so long, then you certainly should change. If it doesn't matter that the transfer is slow, then what gain is there? Again, the OP has not said anything other than "I don't like the size of the file", which isn't much basis for wasting more development effort
                              • 12. Re: huge XML file
                                843789
                                mickey0 wrote:
                                Hello,
                                I have a program that must store on file a lot of words and information about them; I chose as outout format XML; but there are so many words
                                that the file is 10MB; this under is an example of a row (there are 80000 like that); is there a way (XML) to save space? (I don't know, maybe changing something inside the XML)
                                <word hampost="2.7340257709269167E-6" spampost="2.255630617930008E-6" hamFreq="1" spamFreq="0"> ferreirinho </word>
                                thanks,
                                The overhead of XML in this case is only about 50% or less--50% of the storage is spent on XML tags. That's not that much...it's not like it inflates the size by an order of magnitude.

                                You use 115 bytes in that example. About 65 of them are XML overhead, but then a further 20 or so are sunk in using a text format at all. You made the choice to use a human-readable format, but if you abandon that, you could shrink the size by at least 70%. And probably make it much faster to read.

                                If that matters. But when you're storing data like this, I assume you're going to want to be able to read it in an ad-hoc manner later on.