This discussion is archived
10 Replies Latest reply: Feb 10, 2011 2:36 PM by EJP RSS

GZIP decompression of chunked data?

837309 Newbie
Currently Being Moderated
I'm trying to decompress a chunked stream of GZIP compressed data, but I don't know how to solve this without major inefficient workarounds.

The data is coming from a web server, and is sent chunked. This means that before each chunk, the size of the chunk is announced in plain-text, or 0 to terminate.

Simply wrapping the socket stream with the GZIPInputStream, like in the examples, only works if the stream is entirely GZIP, but this is not the case here.

I have to repeatedly do a readLine on the input stream to get the length of the chunk, and then I need to send that amount of bytes from the input stream to the GZIP decompressor. I'm stuck here as I don't know a way to 'send' selected bytes to the decompressor. I only know how to create the decompressor by wrapping an existing input stream, and simply by creating said decompressor it consumes bytes from the input stream to verify if it is a GZIP compressed stream.

The only thing I can come up with is to store the entire compressed data in a huge String, wrap it in a custom-made InputStream subclass that streams bytes from my String, and wrap that in the decompressor. Is this really the only way?

For example, a webpage like this sends it's data chunked and GZIP-compressed: http://www.anidb.net

Edited by: 834306 on Feb 6, 2011 11:16 AM

Edited by: 834306 on Feb 7, 2011 4:54 AM
  • 1. Re: GZIP decompression of chuned data?
    sabre150 Expert
    Currently Being Moderated
    I thought that the InputStream from a chunked HttpURLConnection has already had the chunk information decoded behind the scenes so that one does not have to do any de-chunking of the response.
  • 2. Re: GZIP decompression of chuned data?
    EJP Guru
    Currently Being Moderated
    I think he means that the chunk is zipped individually. For example maybe not all the chunks are zipped.

    OTOH it might be expected that URLConnection could unzip as well given that it knows the content of the chunk is zipped.
  • 3. Re: GZIP decompression of chuned data?
    sabre150 Expert
    Currently Being Moderated
    EJP wrote:
    I think he means that the chunk is zipped individually. For example maybe not all the chunks are zipped.

    OTOH it might be expected that URLConnection could unzip as well given that it knows the content of the chunk is zipped.
    I was confused when I read the OP because I could not see what operations the OP was performing manually and what operations the server was performing behind the scenes. It seemed to me that the OP was getting the server to proved a chunked response and then he expected that he had to perform the de-chunking himself. What added to my confusion was that the OP talked of putting the compressed data in a String which is obviously a no-no .

    If the OP is talking of using standard chunking and assumes that he has to do the de-chunking manually then, if using a chunked URLConnection, he doesn't and he can read the de-chunked data without being aware that it was ever chunked.

    If the server is responding with a mixed bag of data that is chunked then I would expect that each bag item had a means to identify it and how big it is. The client then still does not need to worry about the de-chunking.

    If the OP wants to write his own chunking with each chunk being zipped individually then this would be treated like any other sequential protocol; as a minimum each chunk with have an identifier and length.

    I am still confused. I hope the OP will elaborate.
  • 4. Re: GZIP decompression of chunked data?
    837309 Newbie
    Currently Being Moderated
    Thanks for your responses.

    I indeed thought I had to de-chunk manually. I'm fairly new to all this networking stuff. I've never even heard of a HttpURLConnection.

    What I did before was opening a Socket, sending the manually-constructed request header and then receiving and manually decoding the http header. Something like this (shortened for readability)
    socket = new Socket(url.getHost(), port);
    dataOutputStream = new DataOutputStream(socket.getOutputStream());
    inputStream = new BufferedInputStream(socket.getInputStream());
    String message = "GET " + url.getFile() + " HTTP/1.1" + //
              "\nHost: " + url.getHost() + "" + //
              "\nUser-Agent: Mozilla/1.2 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.13)" + //
              "\nAccept: text/html,application/xml" + //
              "\nAccept-Language: en-us,en;q=0.5" + //
              "\nAccept-Charset: ISO-8859-1" + //
              "\nConnection: close\n\n");
    dataOutputStream.write(message.getBytes());
    <here a lot of code to decode the header and put the field-value pairs in a HashMap>
    <here a lot of code to receive the body of the message, while taking into account the reported content-length and chunking, as well as sending updates to a ProgressListener>
    I tried HttpURLConnection and indeed it seems a lot easier than the way I did things. However, what I receive from the HttpURLConnection is de-chunked but still GZIP compressed. So it is not completely transparent as I hoped. I wrapped in a decoder and it works.

    Here's what I have now:
    HttpURLConnection connection = null;
    try
    {
         connection = (HttpURLConnection)url.openConnection();
         connection.addRequestProperty("Host", url.getHost());
         connection.addRequestProperty("User-Agent", "Mozilla/1.2 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.13)");
         connection.addRequestProperty("Accept", "text/html,application/xml");
         connection.addRequestProperty("Accept-Language", "en-us,en;q=0.5");
         connection.addRequestProperty("Accept-Charset", "ISO-8859-1");
         connection.addRequestProperty("Connection", "close");
    
         connection.setUseCaches(false);
         connection.setDoInput(true);
    
         InputStream inp = connection.getInputStream();
    
         if (connection.getHeaderField("Content-Encoding").equals("gzip"))
         {
              inp = new GZIPInputStream(inp);
         }
    
         int i;
         while ((i = inp.read()) != -1)
         {
              logout.write(i);
         }
         
         connection.disconnect();
    }
    catch (IOException e)
    {
         if (connection != null) connection.disconnect();
    }
    Thanks a bunch.

    Edited by: 834306 on Feb 7, 2011 5:01 AM
  • 5. Re: GZIP decompression of chunked data?
    sabre150 Expert
    Currently Being Moderated
    I don't see where you have used setChunkedStreamingMode() so I doubt if you are actually using chunked mode!
  • 6. Re: GZIP decompression of chunked data?
    837309 Newbie
    Currently Being Moderated
    From what I read, setChunkedStreamingMode() only applies to the request, not the response.
  • 7. Re: GZIP decompression of chunked data?
    sabre150 Expert
    Currently Being Moderated
    834306 wrote:
    From what I read, setChunkedStreamingMode() only applies to the request, not the response.
    You may be right; careful reading of the Javadoc implies that I am wrong. At the moment I can't find a definitive reference; just an indication that chunked mode can be in both directions but not what part of the HTTP protocol defines this or how it is achieved in Java. More research is required.

    EJP is likely to be able to throw light on this.
  • 8. Re: GZIP decompression of chunked data?
    EJP Guru
    Currently Being Moderated
    That's correct. The sending end determines the transfer mode of what he is sending.
  • 9. Re: GZIP decompression of chunked data?
    802889 Explorer
    Currently Being Moderated
    The server is the one deciding if he sents chunked or not, and as long as the request indicated HTTP/1.1 the cleint should support (de-)chunking. On the other hand if the request indicated HTTP/1.0 then the server is not allowed to chunk the content.
  • 10. Re: GZIP decompression of chunked data?
    EJP Guru
    Currently Being Moderated
    The server is the one deciding if he sents chunked or not
    The sender is the one deciding whether he sends chunked or not. The client can send the request body chunked, and the server can send the response chunked. Both ends need to support de-chunking.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points