Lately I’ve been writing a Java program to post a web form and then scrape the data from the resultant web page. For some sites this is pretty easy, unfortunately I ran up against a page where this is not. So some advice to those of you brave enough to attempt this:
1) Use Wireshark to capture what is passed between you and the server where the form is posted. Save the capture file to disk and open it in a text editor. (I use Notepad++) The most important part is an encoded URL where your parameters are passed. These will NOT look the same as you think they might, so make sure the values match up with your program’s values.
2) Third party libraries to do the HTTP POST ended up being a bust. I ended up using a simple java.net.HttpURLConnection and java.net.URL and built the parameters and headers myself.
3) asp servers, at least the one I was dealing with stick in 2 hidden form parameters called eventvalidation and viewstate, extremely long coded strings that are dynamic and you need to include with the POST. To add to the harassment package, the special chars within the parameters needed to be changed between my GET where I got those parameters and the POST where I used them. Observe:
viewState = viewState.replace(“/”, “%2F”);
viewState = viewState.replace(“=”, “%3D”);
viewState = viewState.replace(“+”, “%2B”);
So be wary of that.
4) Parsing the GET and POST html was done with html cleaner, found here:
http://htmlcleaner.sourceforge.net/
Which seemed to be the easiest to work with.