Skip to content

Conversation

@oalders
Copy link

@oalders oalders commented Jul 22, 2014

I was using a plain WWW::Mechanize object today and I was getting the following error:

Couldn't decode '��QO�˒�l0��(�b�E0C���Ʊ��c�k;�D�vwk�a�0N7�3�F$ƥ�o����y�6iAҳf��6�kWŜ��7���r�TX��
                                                                                                  �⎽���└3
                                                                                                         ⎽�땿Ӆ��┴@└�▒��]�F��%V⎽(FQG4▒R?b����X�N◆�8d����ڨ��─:c%ڑ��eJ�±BJ±┴e�#B�
                                                                                                                                                                              ⎺^��≤@�Ѿ�W▮I≤��┘$\]����4�#≤�С#6��2�↓�Z�'Ƥ<��±Mߟ£��]�H:': └▒┌°⎺⎼└ed JSON ⎽├⎼☃┼±← ┼e☃├▒e⎼ ├▒±← ▒⎼⎼▒≤← ⎺b┘ec├← ┼┤└be⎼← ⎽├⎼☃┼± ⎺⎼ ▒├⎺└←
        ▒├ Me├▒CPAN::C┌☃e┼├::Re─┤e⎽├::├⎼≤ π↓↓↓£ (c▒▒⎼▒c├e⎼ ⎺°°⎽e├ ▮ (be°⎺⎼e "\│π1°£\│π°°°d£\│π8£\│π▮£↓↓↓") ▒├ /U⎽e⎼⎽/⎺┌▒°/⎻e⎼┌5/┌☃b/⎻e⎼┌5/Me├▒CPAN/C┌☃e┼├/Re─┤e⎽├↓⎻└:113)
        ▒├ <e┴▒┌>(/U⎽e⎼⎽/⎺┌▒°/⎻e⎼┌5/┌☃b/⎻e⎼┌5/T⎼≤/T☃┼≤↓⎻└:81)
        ▒├ T⎼≤::T☃┼≤::├⎼≤(/U⎽e⎼⎽/⎺┌▒°/⎻e⎼┌5/┌☃b/⎻e⎼┌5/T⎼≤/T☃┼≤↓⎻└:72)
        ▒├ Me├▒CPAN::C┌☃e┼├::Re─┤e⎽├::_dec⎺de_⎼e⎽┤┌├(/U⎽e⎼⎽/⎺┌▒°/⎻e⎼┌5/┌☃b/⎻e⎼┌5/Me├▒CPAN/C┌☃e┼├/Re─┤e⎽├↓⎻└:114)
        ▒├ Me├▒CPAN::C┌☃e┼├::Re─┤e⎽├::°e├c▒(/U⎽e⎼⎽/⎺┌▒°/⎻e⎼┌5/┌☃b/⎻e⎼┌5/Me├▒CPAN/C┌☃e┼├/Re─┤e⎽├↓⎻└:62)
        ▒├ Me├▒CPAN::C┌☃e┼├::°e├c▒(<e┴▒┌>:12)
        ▒├ Me├▒CPAN::C┌☃e┼├::_±e├(/U⎽e⎼⎽/⎺┌▒°/⎻e⎼┌5/┌☃b/⎻e⎼┌5/Me├▒CPAN/C┌☃e┼├↓⎻└:167)
        ▒├ Me├▒CPAN::C┌☃e┼├::_±e├_⎺⎼_⎽e▒⎼c▒(/U⎽e⎼⎽/⎺┌▒°/⎻e⎼┌5/┌☃b/⎻e⎼┌5/Me├▒CPAN/C┌☃e┼├↓⎻└:2▮9)
        ▒├ Me├▒CPAN::C┌☃e┼├::▒┤├▒⎺⎼(/U⎽e⎼⎽/⎺┌▒°/⎻e⎼┌5/┌☃b/⎻e⎼┌5/Me├▒CPAN/C┌☃e┼├↓⎻└:44)
        ▒├ └▒☃┼::(e│▒└⎻┌e⎽/└e├c⎻▒┼↑c┌☃e┼├↓⎻┌:19)
 ▒├ /U⎽e⎼⎽/⎺┌▒°/⎻e⎼┌5/┌☃b/⎻e⎼┌5/Me├▒CPAN/C┌☃e┼├/Re─┤e⎽├↓⎻└ ┌☃┼e 114↓

Returning decoded_content() fixes that for me.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling c2ac44a on oalders:decoded-content into a1acb70 on kentnl:master.

@kentfredric
Copy link
Member

I'm not entirely sure decoding content arbitrarily is a good idea here. Maybe it could be a constructor argument at best, but even that I'm not sure about.

Because the HTTP::Tiny itself doesn't decode content and just passes it verbatim, requiring the consumer to decode correctly.

( Hence, why HTTP::Tiny::UA exists and why I presently have a feature request open for HTTP::Tiny::UA to support ->decoded_content dagolden/HTTP-Tiny-UA#3 )

So it seems what ever is consuming HTTP::Tiny::Mech and assuming it to work like HTTP::Tiny is consuming it incorrectly.

Basically, I'm saying it looks like a bug in the calling code, because if you had a real HTTP::Tiny there, you'd theoretically have the same problem.

@oalders
Copy link
Author

oalders commented Jul 22, 2014

Good info to have. I'll take a closer look at this to see what's going on in the calling code.

@oalders
Copy link
Author

oalders commented Jul 23, 2014

Here's an example of what I'm talking about. The calling code isn't doing anything special.

#!/usr/bin/env perl

use strict;
use warnings;

use Devel::SimpleTrace;
use HTTP::Tiny::Mech;
use MetaCPAN::Client;
use LWP::UserAgent;
use WWW::Mechanize;
use WWW::Mechanize::Cached;

my @ua_classes
    = ( 'WWW::Mechanize::Cached', 'LWP::UserAgent', 'WWW::Mechanize', );

foreach my $ua_class ( @ua_classes ) {
    print "$ua_class\n";

    my $ua = $ua_class->new;
    my $wrapped_ua = HTTP::Tiny::Mech->new( mechua => $ua );

    my $mcpan = MetaCPAN::Client->new( ua => $wrapped_ua );
    $mcpan->author( 'XSAWYERX' );
}

In this example the WWW::Mechanize object always fails for me with the decoding error I listed above. LWP::UserAgent and WWW::Mechanize::Cached are both fine.

@kentfredric
Copy link
Member

MetaCPAN::Client is naughty and quite a few calls will entirely ignore the passed ua. I'll have to look into this case and see whats happening here though.

@kentfredric
Copy link
Member

Presently having trouble even installing the dependencies as Cache::Cache is entirely broken on blead now:

http://matrix.cpantesters.org/?dist=Cache-Cache+1.06

$ cpanm --look Cache::Cache
--> Working on Cache::Cache
Fetching http://www.cpan.org/authors/id/J/JS/JSWARTZ/Cache-Cache-1.06.tar.gz ... OK
Entering /home/kent/.cpanm/work/1406089393.809714/Cache-Cache-1.06 with /bin/bash
kent@katipo2 ~/.cpanm/work/1406089393.809714/Cache-Cache-1.06 $ perl -Ilib -MCache::CacheTester -e1 
Can't use 'defined(@array)' (Maybe you should just omit the defined()?) at lib/Cache/CacheTester.pm line 562.
Compilation failed in require.
BEGIN failed--compilation aborted.

Somebody really aught to see if they can get some maintenance happening there.

@kentfredric
Copy link
Member

Hm, it does appear there is some different behaviour under the hood with WWW::Mechanize.

WWW::Mechanize returns the header:

                 "content-encoding"         => "gzip",

The others do not.

That is to say, the reason the others don't have the problem, is they're not getting the content encoded in the first place.

@kentfredric
Copy link
Member

And this looks like why:

So what we have happening here is this:

#!/usr/bin/env perl

use strict;
use warnings;

use Devel::SimpleTrace;
use HTTP::Tiny;
use MetaCPAN::Client;

my $ua = HTTP::Tiny->new(
    default_headers => {
        'Accept-Encoding' => 'gzip',
    },
);

my $mcpan = MetaCPAN::Client->new( ua => $ua );
$mcpan->author('XSAWYERX');

I don't know what to do here, because I certainly shouldn't be sending headers myself that the user didn't ask for, and I shouldn't be decoding things in case the user is expecting to see them un-decoded.

I'm kinda of the opinion that WWW::Mechanize is doing the wrong thing here, in that if WWW::Mechanize is itself adding the encoding headers, then it should also be handling decoding transparently before responding to the user, but it simply is not. Its setting the encoding and then expecting the user to manually decode it somehow, which strikes me as an odd choice.

And simply arbitrarily applying the decoding is not sufficient, I must also adjust the headers in such case so that the headers no longer indicate that the content is encoded, or code will attempt to re-decode that content and then error incorrectly.

Perhaps @karenetheridge can give some direction here being the maintainer/releaser/whatever for WWW::Mechanize, or at least garner the attention of somebody who knows the answer to this question.

@kentfredric
Copy link
Member

As a temporary workaround, you can avoid WWW::Mech's bad defaults as follows:

#!/usr/bin/env perl

use strict;
use warnings;

use Devel::SimpleTrace;
use HTTP::Tiny::Mech;
use MetaCPAN::Client;
use LWP::UserAgent;
use WWW::Mechanize;
use WWW::Mechanize::Cached;

my @ua_classes =
  ( 'WWW::Mechanize::Cached', 'LWP::UserAgent', 'WWW::Mechanize', );

foreach my $ua_class (@ua_classes) {
    print "$ua_class\n";

    my $ua = $ua_class->new( headers => { 'Accept-Encoding' => 'identity' } );
    my $wrapped_ua = HTTP::Tiny::Mech->new( mechua => $ua );

    my $mcpan = MetaCPAN::Client->new( ua => $wrapped_ua );
    $mcpan->author('XSAWYERX');
}

@karenetheridge
Copy link

On Tue, Jul 22, 2014 at 10:00:35PM -0700, Kent Fredric wrote:

Perhaps @karenetheridge can give some direction here being the maintainer/releaser/whatever for WWW::Mechanize, or at least garner the attention of somebody who knows the answer to this question.

Heh I wish. :) I don't think I've had any contact ever with the "real"
maintainers of WWWM... I just ended up with comaint to fix an install issue
(I blame mst) :D I'll try though...

@kentfredric
Copy link
Member

I'm just going to guess though, that WWW::Mech making this decision a while ago may mean its already unchangeable :(.

My opinion is what should happen is either:

a.

  • user doesn't request an encoding explicitly
  • request is either not performed with encoding, or return value is invisibly decoded in transit hiding the fact it was encoded.

b.

  • user explicitly requests an encoding
  • return value is returned undecoded in that encoding.

And as such

  • ->content should always return the content in the form the user specified, not a form determined at runtime.
  • ->decoded_content should always return data auto-decoded ( current behaviour )
  • ->raw_content should exist to return content in the internal state prior to invisible decoding effects.
  • ->headers should *NOT contain information about the content encoding if user didn't request content encoding or ->content would be auto-decoded.
  • ->raw_headers ( or similar ) should contain headers sufficient to interpret ->raw_content manually.

Anything else seems to leak who's job it is to determine what encodings are an are not acceptable, and WWW::Mech decides for you, and then you get left with the pieces WWW::Mech broke.

@oalders
Copy link
Author

oalders commented Jul 23, 2014

I should add to this that both WWW::Mechanize::GZip and WWW::Mechanize::Cached::GZip also don't explode with the code sample that I posted.

@kentfredric
Copy link
Member

WWW::Mechanize::GZip does some things right:

It

  • explicitly declares internally that it want's gzip
  • explicitly decodes the gzip before handing it over.

Thus, it encapsulates the fact it supporting gzip and hides it from the consumer.

However, it doesn't strip/fix the encoding header (https://metacpan.org/source/PEGI/WWW-Mechanize-GZip-0.12/lib/WWW/Mechanize/GZip.pm#L105)

So that if something was assuming it was HTTP::Tiny, and thus, assuming HTTP::Tiny can't support GZip natively, and thus, added manual support for it in calling code, then it would work with HTTP::Tiny.... but WWW::Mechanize::Gzip would instead return uncompressed data instead of compressed data, and the headers would indicate "this is compressed", and the calling code would likely see that and re-attempt to decompress it already decompressed data, and explode.

WWW::Mechanize::Cached::Gzip just wraps WWW::Mechanzie::Cached so any effects are just one of the two hiding it. ( Though I'll have to work out why WWW::Mechanize::Cached::GZip doesn't implicitly trigger the double-decode bug )

@kentfredric
Copy link
Member

Something is very wrong here and I don't know how to express it.

Spot the difference:

#!/usr/bin/env perl

use strict;
use warnings;

use Devel::SimpleTrace;
use HTTP::Tiny::Mech;
use MetaCPAN::Client;
use LWP::UserAgent;
use WWW::Mechanize;
use WWW::Mechanize::GZip;
use WWW::Mechanize::Cached;
use WWW::Mechanize::Cached::GZip;
use Data::Difference;

use Data::Dump qw( pp );

my @ua_classes
    = ( 'WWW::Mechanize::Cached', 'LWP::UserAgent', 'WWW::Mechanize', 'WWW::Mechanize::Cached::GZip', 'WWW::Mechanize::GZip');

my $uri = 'http://api.metacpan.org/v0/author/XSAWYERX';

my $results = {};

foreach my $ua_class ( @ua_classes ) {
    my $ua = $ua_class->new();
    my $result = $ua->get($uri);
    $results->{ $ua_class } = {
        content_encoding => $result->{'_headers'}->{'content-encoding'},
        content_length   => $result->{'_headers'}->{'content-length'},
        actual_length    => length $result->content,
        decoded_length   => length $result->decoded_content,
    };
}
print pp($results);
{
  "LWP::UserAgent"               => {
                                      actual_length    => 826,
                                      content_encoding => undef,
                                      content_length   => 826,
                                      decoded_length   => 826,
                                    },
  "WWW::Mechanize"               => {
                                      actual_length    => 410,
                                      content_encoding => "gzip",
                                      content_length   => undef,
                                      decoded_length   => 826,
                                    },
  "WWW::Mechanize::Cached"       => {
                                      actual_length    => 826,
                                      content_encoding => undef,
                                      content_length   => undef,
                                      decoded_length   => 826,
                                    },
  "WWW::Mechanize::Cached::GZip" => {
                                      actual_length    => 826,
                                      content_encoding => undef,
                                      content_length   => undef,
                                      decoded_length   => 826,
                                    },
  "WWW::Mechanize::GZip"         => {
                                      actual_length    => 826,
                                      content_encoding => "gzip",
                                      content_length   => 826,
                                      decoded_length   => undef,
                                    },
}

@kentfredric
Copy link
Member

Take special note on that last one: decoded_length => undef,

Yep. ->decoded_content doesn't return anything for WWW::Mechanize::Gzip, so simply calling that method isn't going to save us :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants