WebHarvest can't find response headers

638 Views Asked by At

I'm working with WebHarvest to fetch data from a site that requires logging in.

It's setup like this:

Page 1 = Login page

Page 2 = Login validation page

Page 3 = Statistics page

On page 2 a cookie is set. When monitoring the opening of Page 2 with Firebug I get these headers:

Connection  Keep-Alive
Content-Type    text/html; charset=UTF-8
Date    Tue, 23 Oct 2012 18:25:12 GMT
Keep-Alive  timeout=15, max=100
Server  Apache/2.0.64 (Win32) JRun/4.0 SVN/1.3.2 DAV/2
Set-Cookie  SESSION=hej123;expires=Thu, 16-Oct-2042 18:25:12 GMT;path=/
Transfer-Encoding   chunked

When calling the same page with WebHarvest I only get these headers:

Date=Tue, 23 Oct 2012 18:31:51 GMT
Server=Apache/2.0.64 (Win32) JRun/4.0 SVN/1.3.2 DAV/2
Transfer-Encoding=chunked
Content-Type=text/html; charset=UTF-8

It seems that three headers (Set-Cookie, Connection and Keep-Alive) are not found by WebHarvest. Page 1, 2 and 3 are dummys so no actual validation is done. The cookie is always set on the serverside for Page 2.

Here is the WebHarvest code I am currently using:

<var-def name="content2">
<html-to-xml>
<http method="post" url="http://myurl.com/page2.cfm">
    <http-param name="Login">sigge</http-param>
    <http-param name="Password">hej123</http-param>
    <http-param name="doLogin">Logga in</http-param>
    <loop item="currField">
        <list>
            <var name="ctxtNewInputs" />
        </list>
        <body>
             <script><![CDATA[
                item = (NvPair) currField.getWrappedObject();
                SetContextVar("itemName", item.name);
                SetContextVar("itemValue", item.value);
            ]]></script>
            <http-param name="${item.name}"><var name="itemValue" /></http-param>
        </body>
    </loop>
     <script><![CDATA[
        String keys="";
        for(int i=0;i<http.headers.length;i++) {
            keys+=(http.headers[i].key + "=" + http.headers[i].value +"\n---\n");
        }
        SetContextVar("myCookie", keys);
    ]]></script>
    <file action="write" path="c:/kaka.txt">
        <var name="myCookie"/>
    </file>        
</http>
</html-to-xml>
</var-def>

Edit: when checking I noticed that the cookie is set in WebHarvest, even if the http header can't be found programatically. Is it possible that some response headers are hidden from usage?

Does anyone know a work-around for this problem?

Thank you and best regards, SiggeLund

1

There are 1 best solutions below

0
On

The way to get http header value into user-defined variable scoped for the whole config is the following:

<http url="your.url.here" method="GET">
    <!--Any settings you apply for the POST/GET call-->
</http>
<!--Now you've got your http object you are going to get header value from -->
<!--At it simplest the acquisition of value goes like the below-->
<var-def name="fifth_header_val">
      <script return="http.headers[5].value"/>
</var-def>

The above is just to give a clue. You can iterate over http.headers index and collect keys and values you need for your particular task.