Fetching XML from Bugzilla gives different results with curl versus browser

84 Views Asked by At

Right now, navigating in Chrome to https://bugs.llvm.org/show_bug.cgi?id=12187&ctype=xml shows data for the CC email addresses such as:

<cc>[email protected]</cc>
<cc>[email protected]</cc>
<cc>[email protected]</cc>
<cc>[email protected]</cc>
<cc>[email protected]</cc>

But, when I curl it from the command line,

curl 'https://bugs.llvm.org/show_bug.cgi?id=12187&ctype=xml' | grep '<cc>'

actually fetches the lines

<cc>dgregor</cc>
<cc>llvm-bugs</cc>
<cc>mail.sandbox.de</cc>
<cc>richard-llvm</cc>
<cc>rnk</cc>

without the trailing @domain parts. The same truncation happens with Python requests.get, so it's not specific to curl.

What on earth is going on here? And how can I work around it from curl and/or Python, so that I get the full data as displayed in the browser?

Here's what I see in the browser:

1

There are 1 best solutions below

0
Quuxplusone On

Thanks to Jack's comment, I figured it out myself. What I see in my browser is different from what Jack sees in his browser, because of cookies. I'm logged into my Bugzilla account, so I guess I get to see full email addresses by default, whereas Jack (and curl and requests.get) are logged out, and thus Bugzilla is censoring the data by removing domains from email addresses.

The way to get those domains back is that I need to make my GET request with the appropriate login cookies. Without cookies, as shown in the question, you get this:

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://bugs.llvm.org/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.3"
          urlbase="https://bugs.llvm.org/"

          maintainer="[email protected]"
>
~~~

With the appropriate cookies — whose values I censor here, but you can see the general shape of it —

#!/usr/bin/env python
xml = requests.get(
    'https://bugs.llvm.org/show_bug.cgi?id=12187&ctype=xml',
    cookies={
        'Bugzilla_login': '1234',
        'Bugzilla_logincookie': 'abcdefzyxw',
    },
).text

you get this instead:

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://bugs.llvm.org/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.3"
          urlbase="https://bugs.llvm.org/"

          maintainer="[email protected]"
          exporter="[email protected]"
>
~~~

Notice the extra exporter= key — when you provide cookies, it shows that it knows who you are. And then it shows you email addresses, too!

Basically, Bugzilla is trying to avoid serving any text that resembles a user's email address on any webpage that might be visited by a spider, because the spider might be harvesting email addresses for spammy purposes. But if you're providing auth cookies, then you're not a random spider and it's okay for Bugzilla to serve you full email addresses.

This is the complete answer AFAIK, but please feel free to add if I've missed something!