I'm working with WebHarvest to fetch data from a site that requires logging in.
It's setup like this:
Page 1 = Login page
Page 2 = Login validation page
Page 3 = Statistics page
On page 2 a cookie is set. When monitoring the opening of Page 2 with Firebug I get these headers:
Connection Keep-Alive
Content-Type text/html; charset=UTF-8
Date Tue, 23 Oct 2012 18:25:12 GMT
Keep-Alive timeout=15, max=100
Server Apache/2.0.64 (Win32) JRun/4.0 SVN/1.3.2 DAV/2
Set-Cookie SESSION=hej123;expires=Thu, 16-Oct-2042 18:25:12 GMT;path=/
Transfer-Encoding chunked
When calling the same page with WebHarvest I only get these headers:
Date=Tue, 23 Oct 2012 18:31:51 GMT
Server=Apache/2.0.64 (Win32) JRun/4.0 SVN/1.3.2 DAV/2
Transfer-Encoding=chunked
Content-Type=text/html; charset=UTF-8
It seems that three headers (Set-Cookie, Connection and Keep-Alive) are not found by WebHarvest. Page 1, 2 and 3 are dummys so no actual validation is done. The cookie is always set on the serverside for Page 2.
Here is the WebHarvest code I am currently using:
<var-def name="content2">
<html-to-xml>
<http method="post" url="http://myurl.com/page2.cfm">
<http-param name="Login">sigge</http-param>
<http-param name="Password">hej123</http-param>
<http-param name="doLogin">Logga in</http-param>
<loop item="currField">
<list>
<var name="ctxtNewInputs" />
</list>
<body>
<script><![CDATA[
item = (NvPair) currField.getWrappedObject();
SetContextVar("itemName", item.name);
SetContextVar("itemValue", item.value);
]]></script>
<http-param name="${item.name}"><var name="itemValue" /></http-param>
</body>
</loop>
<script><![CDATA[
String keys="";
for(int i=0;i<http.headers.length;i++) {
keys+=(http.headers[i].key + "=" + http.headers[i].value +"\n---\n");
}
SetContextVar("myCookie", keys);
]]></script>
<file action="write" path="c:/kaka.txt">
<var name="myCookie"/>
</file>
</http>
</html-to-xml>
</var-def>
Edit: when checking I noticed that the cookie is set in WebHarvest, even if the http header can't be found programatically. Is it possible that some response headers are hidden from usage?
Does anyone know a work-around for this problem?
Thank you and best regards, SiggeLund
The way to get http header value into user-defined variable scoped for the whole config is the following:
The above is just to give a clue. You can iterate over http.headers index and collect keys and values you need for your particular task.