Added document on NTLM support; ported HTTP client programming primer from WIKI
git-svn-id: https://svn.apache.org/repos/asf/httpcomponents/httpclient/trunk@657143 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
4022a78140
commit
14d0e406a3
|
@ -62,7 +62,7 @@ Features
|
||||||
|
|
||||||
* Tunneled HTTPS connections through HTTP proxies, via the CONNECT method.
|
* Tunneled HTTPS connections through HTTP proxies, via the CONNECT method.
|
||||||
|
|
||||||
* Basic, Digest authentication schemes. Please note NTLM is currently not supported.
|
* Basic, Digest authentication schemes. Please note NTLM is supported only partially.
|
||||||
|
|
||||||
* Plug-in mechanism for custom authentication schemes.
|
* Plug-in mechanism for custom authentication schemes.
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,183 @@
|
||||||
|
~~ $HeadURL$
|
||||||
|
~~ $Revision$
|
||||||
|
~~ $Date$
|
||||||
|
~~
|
||||||
|
~~ ====================================================================
|
||||||
|
~~ Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
~~ or more contributor license agreements. See the NOTICE file
|
||||||
|
~~ distributed with this work for additional information
|
||||||
|
~~ regarding copyright ownership. The ASF licenses this file
|
||||||
|
~~ to you under the Apache License, Version 2.0 (the
|
||||||
|
~~ "License"); you may not use this file except in compliance
|
||||||
|
~~ with the License. You may obtain a copy of the License at
|
||||||
|
~~
|
||||||
|
~~ http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
~~
|
||||||
|
~~ Unless required by applicable law or agreed to in writing,
|
||||||
|
~~ software distributed under the License is distributed on an
|
||||||
|
~~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
~~ KIND, either express or implied. See the License for the
|
||||||
|
~~ specific language governing permissions and limitations
|
||||||
|
~~ under the License.
|
||||||
|
~~ ====================================================================
|
||||||
|
~~
|
||||||
|
~~ This software consists of voluntary contributions made by many
|
||||||
|
~~ individuals on behalf of the Apache Software Foundation. For more
|
||||||
|
~~ information on the Apache Software Foundation, please see
|
||||||
|
~~ <http://www.apache.org/>.
|
||||||
|
|
||||||
|
----------
|
||||||
|
NTLM support in HttpClient
|
||||||
|
----------
|
||||||
|
----------
|
||||||
|
----------
|
||||||
|
|
||||||
|
NTLM support in HttpClient
|
||||||
|
|
||||||
|
Currently HttpClient 4.0 does not provide support for the NTLM authentication scheme
|
||||||
|
out of the box and probably never will. The reasons for that are legal rather than
|
||||||
|
technical.
|
||||||
|
|
||||||
|
* Background
|
||||||
|
|
||||||
|
NTLM is a proprietary authentication scheme developed by Microsoft and optimized for
|
||||||
|
Windows operating system.
|
||||||
|
|
||||||
|
Until year 2008 there was no official, publicly available, complete documentation of
|
||||||
|
the protocol. {{{http://davenport.sourceforge.net/ntlm.html}Unofficial}} 3rd party
|
||||||
|
protocol descriptions existed as a result of reverse-engineering efforts. It was not
|
||||||
|
really known whether the protocol based on the reverse-engineering were complete or
|
||||||
|
even correct.
|
||||||
|
|
||||||
|
Microsoft published {{{http://download.microsoft.com/download/a/e/6/ae6e4142-aa58-45c6-8dcf-a657e5900cd3/%5BMS-NLMP%5D.pdf}MS-NLMP}}
|
||||||
|
and {{{http://download.microsoft.com/download/a/e/6/ae6e4142-aa58-45c6-8dcf-a657e5900cd3/%5BMS-NTHT%5D.pdf}MS-NTHT}}
|
||||||
|
specifications in February 2008 as a part of its
|
||||||
|
{{{http://www.microsoft.com/interop/principles/default.mspx}Interoperability
|
||||||
|
Principles initiative}}. Unfortunately, it is still not entirely clear whether NTLM
|
||||||
|
encryption algorithms are covered by any patents held by Microsoft, which would make
|
||||||
|
commercial users of open-source NTLM implementations liable for the use of Microsoft
|
||||||
|
intellectual property.
|
||||||
|
|
||||||
|
* Enabling NTLM support in HttpClient 4.x
|
||||||
|
|
||||||
|
The good news is HttpClient is fully NTLM capable right out of the box.
|
||||||
|
HttpClient ships with the NTLM authentication scheme, which, if configured
|
||||||
|
to use an external NTLM engine, can handle NTLM challenges and authenticate
|
||||||
|
against NTLM servers.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
public interface NTLMEngine {
|
||||||
|
|
||||||
|
String generateType1Msg(
|
||||||
|
String domain,
|
||||||
|
String workstation) throws NTLMEngineException;
|
||||||
|
|
||||||
|
String generateType3Msg(
|
||||||
|
String username,
|
||||||
|
String password,
|
||||||
|
String domain,
|
||||||
|
String workstation,
|
||||||
|
String challenge) throws NTLMEngineException;
|
||||||
|
|
||||||
|
}
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
* Using Samba JCIFS as an NTLM engine
|
||||||
|
|
||||||
|
Follow these instructions to build an NTLMEngine implementation using JCIFS library
|
||||||
|
|
||||||
|
<<!!!!DISCLAIMER !!!! HttpComponents project DOES _NOT_ SUPPORT the code provided below.
|
||||||
|
Use it as is at your own discretion>>.
|
||||||
|
|
||||||
|
* Download the latest release of the JCIFS library from the
|
||||||
|
{{{http://jcifs.samba.org/}Samba}} web site
|
||||||
|
|
||||||
|
* Implement NTLMEngine interface
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
import jcifs.ntlmssp.Type1Message;
|
||||||
|
import jcifs.ntlmssp.Type2Message;
|
||||||
|
import jcifs.ntlmssp.Type3Message;
|
||||||
|
import jcifs.util.Base64;
|
||||||
|
|
||||||
|
import org.apache.http.impl.auth.NTLMEngine;
|
||||||
|
import org.apache.http.impl.auth.NTLMEngineException;
|
||||||
|
|
||||||
|
public class JCIFSEngine implements NTLMEngine {
|
||||||
|
|
||||||
|
public String generateType1Msg(
|
||||||
|
String domain,
|
||||||
|
String workstation) throws NTLMEngineException {
|
||||||
|
|
||||||
|
Type1Message t1m = new Type1Message(
|
||||||
|
Type1Message.getDefaultFlags(),
|
||||||
|
domain,
|
||||||
|
workstation);
|
||||||
|
return Base64.encode(t1m.toByteArray());
|
||||||
|
}
|
||||||
|
|
||||||
|
public String generateType3Msg(
|
||||||
|
String username,
|
||||||
|
String password,
|
||||||
|
String domain,
|
||||||
|
String workstation,
|
||||||
|
String challenge) throws NTLMEngineException {
|
||||||
|
Type2Message t2m;
|
||||||
|
try {
|
||||||
|
t2m = new Type2Message(Base64.decode(challenge));
|
||||||
|
} catch (IOException ex) {
|
||||||
|
throw new NTLMEngineException("Invalid Type2 message", ex);
|
||||||
|
}
|
||||||
|
Type3Message t3m = new Type3Message(
|
||||||
|
t2m,
|
||||||
|
password,
|
||||||
|
domain,
|
||||||
|
username,
|
||||||
|
workstation);
|
||||||
|
return Base64.encode(t3m.toByteArray());
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
* Implement AuthSchemeFactory interface
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
import org.apache.http.auth.AuthScheme;
|
||||||
|
import org.apache.http.auth.AuthSchemeFactory;
|
||||||
|
import org.apache.http.impl.auth.NTLMScheme;
|
||||||
|
import org.apache.http.params.HttpParams;
|
||||||
|
|
||||||
|
public class NTLMSchemeFactory implements AuthSchemeFactory {
|
||||||
|
|
||||||
|
public AuthScheme newInstance(final HttpParams params) {
|
||||||
|
return new NTLMScheme(new JCIFSEngine());
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
* Register NTLMSchemeFactory with the HttpClient instance you want to NTLM
|
||||||
|
enable.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
httpclient.getAuthSchemes().register("ntlm", new NTLMSchemeFactory());
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
* Set NTCredentials for the web server you are going to access.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
httpclient.getCredentialsProvider().setCredentials(
|
||||||
|
new AuthScope("myserver", -1),
|
||||||
|
new NTCredentials("username", "password", "MYSERVER", "MYDOMAIN"));
|
||||||
|
-----------------------------------------------------------
|
||||||
|
|
||||||
|
* You are done.
|
||||||
|
|
||||||
|
|
||||||
|
* Why this code is not distributed with HttpClient
|
||||||
|
|
||||||
|
JCIFS is licensed under the Lesser General Public License (LGPL). This license
|
||||||
|
is not compatible with the Apache Licenses under which all Apache Software is
|
||||||
|
released. Lawyers of the Apache Software Foundation are currently investigating
|
||||||
|
under which conditions Apache software is allowed to make use of LGPL software.
|
|
@ -0,0 +1,670 @@
|
||||||
|
~~ $HeadURL$
|
||||||
|
~~ $Revision$
|
||||||
|
~~ $Date$
|
||||||
|
~~
|
||||||
|
~~ ====================================================================
|
||||||
|
~~ Licensed to the Apache Software Foundation (ASF) under one
|
||||||
|
~~ or more contributor license agreements. See the NOTICE file
|
||||||
|
~~ distributed with this work for additional information
|
||||||
|
~~ regarding copyright ownership. The ASF licenses this file
|
||||||
|
~~ to you under the Apache License, Version 2.0 (the
|
||||||
|
~~ "License"); you may not use this file except in compliance
|
||||||
|
~~ with the License. You may obtain a copy of the License at
|
||||||
|
~~
|
||||||
|
~~ http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
~~
|
||||||
|
~~ Unless required by applicable law or agreed to in writing,
|
||||||
|
~~ software distributed under the License is distributed on an
|
||||||
|
~~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
~~ KIND, either express or implied. See the License for the
|
||||||
|
~~ specific language governing permissions and limitations
|
||||||
|
~~ under the License.
|
||||||
|
~~ ====================================================================
|
||||||
|
~~
|
||||||
|
~~ This software consists of voluntary contributions made by many
|
||||||
|
~~ individuals on behalf of the Apache Software Foundation. For more
|
||||||
|
~~ information on the Apache Software Foundation, please see
|
||||||
|
~~ <http://www.apache.org/>.
|
||||||
|
|
||||||
|
----------
|
||||||
|
Client HTTP Programming Primer
|
||||||
|
----------
|
||||||
|
----------
|
||||||
|
----------
|
||||||
|
|
||||||
|
Client HTTP Programming Primer
|
||||||
|
|
||||||
|
* About
|
||||||
|
|
||||||
|
This document is intended for people who suddenly have to or want to implement
|
||||||
|
an application that automates something usually done with a browser,
|
||||||
|
but are missing the background to understand what they actually need to do.
|
||||||
|
It provides guidance on the steps required to implement a program that
|
||||||
|
interacts with a web site which is designed to be used with a browser.
|
||||||
|
It does not save you from eventually learning the background of what
|
||||||
|
you are doing, but it should help you to get started quickly and learn
|
||||||
|
the details later.
|
||||||
|
|
||||||
|
This document has evolved from discussions on the HttpClient mailing lists.
|
||||||
|
Although it refers to HttpClient, the concepts described here apply equally
|
||||||
|
to HttpComponents or SUN's {{{http://java.sun.com/j2se/1.4.2/docs/api/java/net/HttpURLConnection.html}HttpURLConnection}}
|
||||||
|
or any other HTTP communication library for any programming language. So you
|
||||||
|
might find it useful even if you're not using Java and HttpClient.
|
||||||
|
|
||||||
|
The existence of this document does not imply that the HttpClient community
|
||||||
|
feels responsible for teaching you how to program a client HTTP application.
|
||||||
|
It is merely a way for us to reduce the noise on the mailing list without
|
||||||
|
just leaving the newbies out in the cold.
|
||||||
|
|
||||||
|
* Scenario
|
||||||
|
|
||||||
|
Let's assume that you have some kind of repetitive, web-based task that
|
||||||
|
you want to automate. Something like:
|
||||||
|
|
||||||
|
* goto page http://xxx.yyy.zzz/login.html
|
||||||
|
|
||||||
|
* enter username and password in a web form and hit the "login" button
|
||||||
|
|
||||||
|
* navigate to a specific page
|
||||||
|
|
||||||
|
* check the number/headline/whatever shown on that page
|
||||||
|
|
||||||
|
[]
|
||||||
|
|
||||||
|
At this time, we don't have a specific example which could be developed
|
||||||
|
into a sample application. So this document is all bla-bla, and you will
|
||||||
|
have to work out the details - all the details - yourself. Such is life.
|
||||||
|
|
||||||
|
* Caveat
|
||||||
|
|
||||||
|
This scenario describes a hobbyist usage of HTTP, in other words:
|
||||||
|
<<a bad practice>>. Web sites are designed for user interaction, not
|
||||||
|
as an application programming interface (API). The interface of a
|
||||||
|
web site is the user interface displayed by a browser. The HTTP
|
||||||
|
communication between the browser and the server is an internal API,
|
||||||
|
subject to change without notice.
|
||||||
|
|
||||||
|
A web site can be redesigned at any point in time. The server then
|
||||||
|
sends different documents and a browser will display the new content.
|
||||||
|
The user easily adjusts to click the appropriate links, and the browser
|
||||||
|
communicates via HTTP as specified by the new documents from the server.
|
||||||
|
Your application that only mimicks a browser will simply break.
|
||||||
|
|
||||||
|
Nevertheless, implementing this scenario will help you to get
|
||||||
|
familiar with HTTP communication. It is also "good enough" for
|
||||||
|
hobbyists applications, for example if you want to download the
|
||||||
|
latest installment of your favorite daily webcomic to install
|
||||||
|
it as the screen background. There is no big damage if such an
|
||||||
|
application breaks.
|
||||||
|
|
||||||
|
If you want to implement a solid application, you should use only
|
||||||
|
published APIs. For example, to check for new mail on your webmail
|
||||||
|
account, you should ask the webmail provider for POP or IMAP access.
|
||||||
|
These are standardized protocols supported my most EMail client applications.
|
||||||
|
If you want to have a newsticker, look for RSS feeds from the provider and
|
||||||
|
applications that display them.
|
||||||
|
|
||||||
|
As another example, if you want to perform a web search, there are
|
||||||
|
search companies that provide an API for using their search engines.
|
||||||
|
Unlike the examples before, such APIs are proprietary. You will still
|
||||||
|
have to implement an application, but then you are using a published API
|
||||||
|
that the provider will not change without notice.
|
||||||
|
|
||||||
|
|
||||||
|
* Not a Browser
|
||||||
|
|
||||||
|
HttpClient is not a browser. Here's the difference.
|
||||||
|
|
||||||
|
<<Browser>>
|
||||||
|
|
||||||
|
[images/browser.png] Browser
|
||||||
|
|
||||||
|
The figure shows some of the components you will find in a browser.
|
||||||
|
To the left, there is the user interface. The browser needs a rendering
|
||||||
|
engine to display pages, and to interpret user input such as mouse clicks
|
||||||
|
somewhere on the displayed page. There is a layout engine which computes
|
||||||
|
how an HTML page should be displayed, including cascading style sheets
|
||||||
|
and images. A JavaScript interpreter runs JavaScript code embedded in
|
||||||
|
or referenced from HTML pages. Events from the user interface are passed
|
||||||
|
to the JavaScript interpreter for processing.
|
||||||
|
On the top, there are interfaces for plugins that can handle Applets,
|
||||||
|
embedded media objects like PDF files, Quicktime movies and Flash animations,
|
||||||
|
or ActiveX controls that can do anything.
|
||||||
|
|
||||||
|
In the center of the figure you can find internal components. Browsers
|
||||||
|
have a cache of recently accessed documents and image files. They need
|
||||||
|
to remember cookies and passwords entered by the user. Such information
|
||||||
|
can be kept in memory or stored persistently in the file system at the
|
||||||
|
bottom of the figure, to be available again when the browser is restarted.
|
||||||
|
Certificates for secure communication are almost always stored persistently.
|
||||||
|
To the right of the figure is the network. Browsers support many protocols
|
||||||
|
on different levels of abstraction. There are application protocols
|
||||||
|
such as FTP and HTTP to retrieve documents from servers, and transport
|
||||||
|
layer protocols such as TLS/SSL and Socks to establish connections for
|
||||||
|
the application protocols.
|
||||||
|
|
||||||
|
One characteristic of browsers that is not shown in the figure is tolerance
|
||||||
|
for bad input. There needs to be tolerance for invalid user input to make
|
||||||
|
the browser user friendly. There also needs to be tolerance for malformed
|
||||||
|
documents retrieved from servers, and for flaws in server behavior when
|
||||||
|
executing protocols, to make as many websites as possible accessible to
|
||||||
|
the user.
|
||||||
|
|
||||||
|
<<HTTP Client>>
|
||||||
|
|
||||||
|
[images/httpclient.png] HTTP Client
|
||||||
|
|
||||||
|
The figure shows some of the components you will find in a browser,
|
||||||
|
and highlights the scope of HttpClient. The primary responsibility
|
||||||
|
of HttpClient is the HTTP protocol, executed directly or through an
|
||||||
|
HTTP proxy. It provides interfaces and default implementations for
|
||||||
|
cookie and password management, but not for persisting such data.
|
||||||
|
User interfacing, HTML parsing, plugins or non-HTTP application level
|
||||||
|
protocols are not in the scope of HttpClient. It does provide interfaces
|
||||||
|
to plug in transport layer protocols, but it does not implement such
|
||||||
|
protocols.
|
||||||
|
|
||||||
|
All the rest of a browser's functionality you require needs to be
|
||||||
|
provided by your application. HttpClient executes HTTP requests, but it
|
||||||
|
will not and can not assemble them. Since HttpClient does not interface
|
||||||
|
with the user, nor interpret content such as HTML files, there is
|
||||||
|
little or no tolerance for bad data passed to the API. There is some
|
||||||
|
tolerance for flaws in server behavior, but there are limits to the
|
||||||
|
deviations HttpClient can handle.
|
||||||
|
|
||||||
|
* Terminology
|
||||||
|
|
||||||
|
This section introduces some important terms you have to know to
|
||||||
|
understand the rest of this document.
|
||||||
|
|
||||||
|
<<<HTTP Message>>>
|
||||||
|
|
||||||
|
consists of a header section and an optional entity. There are two kinds
|
||||||
|
of messages, requests and responses. They differ in the format of the
|
||||||
|
first line, but both can have header fields and an optional entity.
|
||||||
|
|
||||||
|
<<<HTTP Request>>>
|
||||||
|
|
||||||
|
is sent from a client to a server. The first line includes the URI for
|
||||||
|
which the request is sent, and a method that the server should execute
|
||||||
|
for the client.
|
||||||
|
|
||||||
|
<<<HTTP Response>>>
|
||||||
|
|
||||||
|
is sent from a server to a client in response to a request. The first
|
||||||
|
line includes a status code that tells about success or failure of
|
||||||
|
the request. HTTP defines a set of status codes, like 200 for success
|
||||||
|
and 404 for not found. Other protocols based on HTTP can define
|
||||||
|
additional status codes.
|
||||||
|
|
||||||
|
<<<Method>>>
|
||||||
|
|
||||||
|
is an operation requested from the server. HTTP defines a set of
|
||||||
|
operations, the most frequent being GET and POST. Other protocols
|
||||||
|
based on HTTP can define additional methods.
|
||||||
|
|
||||||
|
<<<Header Fields>>>
|
||||||
|
|
||||||
|
are name-value pairs, where both name and value are text. The name of
|
||||||
|
a header field is not case sensitive. Multiple values can be assigned
|
||||||
|
to the same name. RFC 2616 defines a wide range
|
||||||
|
of header fields for handling various aspects of the HTTP protocol.
|
||||||
|
Other specifications, like RFC 2617 and RFC 2965, define additional
|
||||||
|
headers. Some of the defined headers are for general use, others are
|
||||||
|
meant for exclusive use with either requests or responses, still others
|
||||||
|
are meant for use only with an entity.
|
||||||
|
|
||||||
|
<<<Entity>>>
|
||||||
|
|
||||||
|
is data sent with an HTTP message. For example, a response can contain
|
||||||
|
the page or image you are downloading as an entity, or a request can
|
||||||
|
include the parameters that you entered into a web form.
|
||||||
|
The entity of an HTTP message can have an arbitrary data format, which
|
||||||
|
is usually specified as a MIME type in a header field.
|
||||||
|
|
||||||
|
<<<Session>>>
|
||||||
|
|
||||||
|
is a series of requests from a single source to a server. The server
|
||||||
|
can keep session data, and needs to recognize the session to which
|
||||||
|
each incoming request belongs. For example, if you execute a web search,
|
||||||
|
the server will only return one page of search results. But it keeps
|
||||||
|
track of the other results and makes them available when you click on
|
||||||
|
the link to the "next" page. The server needs to know from the request
|
||||||
|
that it is you and your session for which more results are requested,
|
||||||
|
and not me and my session. That's because I searched for something else.
|
||||||
|
|
||||||
|
<<<Cookies>>>
|
||||||
|
|
||||||
|
are the preferred way for servers to track sessions. The server supplies
|
||||||
|
a piece of data, called a cookie, in response to a request. The server
|
||||||
|
expects the client to send that piece of data in a header field with each
|
||||||
|
following request of the same session.
|
||||||
|
The cookie is different for each session, so the server can identify to
|
||||||
|
which session a request belongs by looking at the cookie. If the cookie
|
||||||
|
is missing from a request, the server will not respond as expected.
|
||||||
|
|
||||||
|
* Step by Step
|
||||||
|
|
||||||
|
** GET the Login Page
|
||||||
|
|
||||||
|
Create and execute a GET request for the login page.
|
||||||
|
Just use the link you would type into the browser as the URL.
|
||||||
|
This is what a browser does when you enter a URL in the address bar
|
||||||
|
or when you click on a link that points to another web page.
|
||||||
|
|
||||||
|
Inspect the response from the server:
|
||||||
|
|
||||||
|
* do you get the page you expected?
|
||||||
|
|
||||||
|
[]
|
||||||
|
|
||||||
|
It should be sent as the entity of the response to your request.
|
||||||
|
The entity is also referred to as the response body.
|
||||||
|
|
||||||
|
* do you get a session cookie?
|
||||||
|
|
||||||
|
[]
|
||||||
|
|
||||||
|
Cookies are sent in a header field named Set-Cookie or Set-Cookie2.
|
||||||
|
It is possible that you don't get a session cookie until you log in.
|
||||||
|
If there is no session cookie in the response, you'll have to do perform
|
||||||
|
step 2 later, after you reach the point where the cookie is set.
|
||||||
|
|
||||||
|
If you do not get the page you expect, check the URL you are requesting.
|
||||||
|
If it is correct, the server may use a browser detection. You will have
|
||||||
|
to set the header field User-Agent to a value used by a popular browser
|
||||||
|
to pretend that the request is coming from that browser.
|
||||||
|
|
||||||
|
If you can't get the login page, get the home page instead now.
|
||||||
|
Get the login page in the next step, when you establish the session.
|
||||||
|
|
||||||
|
** Establish the Session
|
||||||
|
|
||||||
|
Create and execute another GET request for a page.
|
||||||
|
You can simply request the login page again, or some other page
|
||||||
|
of which you know the URL. Do NOT try to get a page which would
|
||||||
|
be returned in response to submitting a web form. Use something
|
||||||
|
you can reach simply by clicking on a link in the browser. Something
|
||||||
|
where you can see the URL in the browser status line while the
|
||||||
|
mouse pointer is hovering over the link.
|
||||||
|
|
||||||
|
This step is important when developing the application. Once you know
|
||||||
|
that your application does establish the session correctly, you may
|
||||||
|
be able to remove it. Only if you couldn't get the login page directly
|
||||||
|
and had to get the home page first, you know you have to leave it in.
|
||||||
|
|
||||||
|
Inspect the request being sent to the server.
|
||||||
|
|
||||||
|
* is the session cookie sent with the request?
|
||||||
|
|
||||||
|
[]
|
||||||
|
|
||||||
|
You can see what is sent to the server by enabling the wire log
|
||||||
|
for HttpClient. You only need to see the request headers, not the body.
|
||||||
|
The session cookie should be sent in a header field called Cookie.
|
||||||
|
There may be several of those, and other cookies might be sent as well.
|
||||||
|
|
||||||
|
Inspect the response from the server:
|
||||||
|
|
||||||
|
* do you get another session cookie?
|
||||||
|
|
||||||
|
[]
|
||||||
|
|
||||||
|
You should not get another session cookie. If you get the same session
|
||||||
|
cookie as before, the server behaves a little strange but that should
|
||||||
|
not be a problem. If you get a new session cookie, then the server did
|
||||||
|
not recognize the session for the request. Usually, this happens if the
|
||||||
|
request did not contain the session cookie. But servers might use other
|
||||||
|
means to track sessions, or to detect session hijacking.
|
||||||
|
|
||||||
|
If the session cookie is not sent in the request, one of two things
|
||||||
|
has gone wrong. Either the cookie was not detected in the previous
|
||||||
|
response, or the cookie was not selected for being sent with the new
|
||||||
|
request.
|
||||||
|
|
||||||
|
HttpClient automatically parses cookies sent in responses and puts them
|
||||||
|
into a cookie store. HttpClient uses a configurable cookie policy
|
||||||
|
to decide whether a cookie being sent from a server is correct.
|
||||||
|
The default policy complies strictly with RFC 2109, but many servers
|
||||||
|
do not. Play around with the cookie policies until the cookie is
|
||||||
|
accepted and put into the cookie store.
|
||||||
|
|
||||||
|
If the cookie is accepted from the previous response but still not
|
||||||
|
sent with the new request, make sure that HttpClient uses the same
|
||||||
|
cookie store object. Unless you explicitly manage cookie store
|
||||||
|
objects (not recommended for newbies!), this will be the case if you
|
||||||
|
use the same HttpClient object to execute both requests.
|
||||||
|
|
||||||
|
If the cookie is still not sent with the request, make sure that the
|
||||||
|
URL you are requesting is in the scope for the cookie. Cookies are
|
||||||
|
only sent to the domain and path specified in the cookie scope.
|
||||||
|
A cookie for host "jakarta.apache.org" will not be sent to host
|
||||||
|
"tomcat.apache.org". A cookie for domain ".apache.org" will be sent
|
||||||
|
to both. A cookie for host "apache.org", without the leading dot,
|
||||||
|
will not be sent to "jakarta.apache.org". The latter case can be
|
||||||
|
resolved by using a different cookie spec that adds the leading dot.
|
||||||
|
In the other cases, use a URL that in the cookie scope to establish
|
||||||
|
the session.
|
||||||
|
|
||||||
|
If the session cookie is sent with the request, but a new session cookie
|
||||||
|
is set in the response anyway, check whether there are cookies other
|
||||||
|
than the session cookie in the request. Some servers are incapable of
|
||||||
|
detecting multiple cookies sent in individual header fields. HttpClient
|
||||||
|
can be advised to put all cookies into a single header field.
|
||||||
|
|
||||||
|
If that doesn't help, you are in trouble. The server may use additional
|
||||||
|
means to track the session, for example the header field named Referer.
|
||||||
|
Set that field to the URL of the previous request.
|
||||||
|
({{{http://mail-archives.apache.org/mod_mbox/jakarta-httpclient-user/200602.mbox/%3c19b.44e04b45.31166eaa@aol.com%3e}see this mail}})
|
||||||
|
|
||||||
|
If that doesn't help either, you will have to compare the request from
|
||||||
|
your application to a corresponding one generated by a browser. The
|
||||||
|
instructions in step 5 for POST requests apply for GET requests as well.
|
||||||
|
It's even simpler with GET, since you don't have an entity.
|
||||||
|
|
||||||
|
** Analyze the Form
|
||||||
|
|
||||||
|
Now it is time to analyze the form defined in the HTML markup of the page.
|
||||||
|
A form in HTML is a set of name-value-pairs called parameters, where some
|
||||||
|
of the values can be entered in the browser. By analyzing the HTML markup,
|
||||||
|
you can learn which parameters you have to define and how to send them
|
||||||
|
to the server.
|
||||||
|
|
||||||
|
Look for the <form> tag in the page source. There may be several forms in
|
||||||
|
the page, but they can not be nested. Locate the form you want to submit.
|
||||||
|
Locate the matching </form> tag. Everything in between the two may be
|
||||||
|
relevant. Let's start with the attributes of the <form> tag:
|
||||||
|
|
||||||
|
<<<method=>>>
|
||||||
|
|
||||||
|
specifies the method used for submitting the form. If it is GET or
|
||||||
|
not specified at all, then you need to create a GET request. The parameters
|
||||||
|
will be added as a query string to the URL. If the method is POST, you
|
||||||
|
need to create a POST request. The parameters will be put in the entity
|
||||||
|
of the request, also referred to as the request body.
|
||||||
|
How to do that is discussed in step 5.
|
||||||
|
|
||||||
|
<<<action=>>>
|
||||||
|
|
||||||
|
specifies the URL to which the request has to be sent. Do not try to
|
||||||
|
get this URL from the address bar of your browser! A browser will
|
||||||
|
automatically follow redirects and only displays the final URL, which
|
||||||
|
can be different from the URL in this attribute.
|
||||||
|
It is possible that the URL includes a query string that specifies
|
||||||
|
some parameters. If so, keep that in mind.
|
||||||
|
|
||||||
|
<<<enctype=>>>
|
||||||
|
|
||||||
|
specifies the MIME type for the entity of the request generated by the
|
||||||
|
form. The two common cases are url-encoded (default) and multipart-mime.
|
||||||
|
Note that these terms are just informally used here, the exact values
|
||||||
|
that need to be written in an HTML document are specified elsewhere.
|
||||||
|
This attribute is only used for the POST method. If the method is GET,
|
||||||
|
the parameters will always be url-encoded, but not in an entity.
|
||||||
|
|
||||||
|
<<<accept-charset=>>>
|
||||||
|
|
||||||
|
specifies the character set that the browser should allow for user input.
|
||||||
|
It will not be discussed here, but you will have to consider this value
|
||||||
|
if you experience charset related problems.
|
||||||
|
|
||||||
|
Except for optional query parameters in the action attribute, the parameters
|
||||||
|
of a form are specified by HTML tags between <form> and </form>.
|
||||||
|
The following is a list of tags that can be used to define parameters.
|
||||||
|
Except where stated otherwise, they have a name attribute which specifies
|
||||||
|
the name of the parameter. The value of the parameter usually depends on
|
||||||
|
user input.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
<input type="text" name="...">
|
||||||
|
<input type="password" name="...">
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
specify single-line input fields. Using the return key in one of these
|
||||||
|
fields will submit the form, so the value really is a single line of
|
||||||
|
input from the user.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
<input type="text" readonly name="..." value="...">
|
||||||
|
<input type="hidden" name="..." value="...">
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
specify a parameter that can not be changed by the user.
|
||||||
|
The value of the parameter is given by the value attribute.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
<input type="radio" name="..." value="...">
|
||||||
|
<input type="checkbox" name="..." value="...">
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
specify a parameter that can be included or omitted. There usually is
|
||||||
|
more than one tag with the same name. For radio buttons, only one can
|
||||||
|
be selected and the value of the parameter is the value of the selected
|
||||||
|
radio button. For checkboxes, more than one can be selected. There will
|
||||||
|
be one name-value-pair for each selected checkbox, with the same name
|
||||||
|
for all of them.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
<input type="submit" name="..." value="...">
|
||||||
|
<button type="submit" name="..." value="...">
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
specify a button to submit the form. The parameter will only be added
|
||||||
|
to the form if that button is used to submit. If another button is used,
|
||||||
|
or the form is submitted by pressing the return key in a text input field,
|
||||||
|
the parameter is not part of the submitted form data. If the name attribute
|
||||||
|
is missing, no parameter is added to the form data for that button.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
<textarea name="...">
|
||||||
|
<textarea value="..." readonly>
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
specify a multi-line input field. In the readonly case, the value of
|
||||||
|
the parameter is the text between the <textarea> and </textarea> tags.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
<select name="..." multiple>}}}
|
||||||
|
<option value="...">...</option>}}}
|
||||||
|
<option value="...">...</option>}}}
|
||||||
|
...
|
||||||
|
</select>
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
specify a selection list or drop-down menu. If the multiple attribute is
|
||||||
|
not present, only one option can be selected. There will be one
|
||||||
|
name-value-pair for each selected option, with the same name for all of them.
|
||||||
|
If there is no value attribute, the value for that option is
|
||||||
|
the text between <option> and </option>.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
<input type="image" name="...">
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
specifies an image that can be clicked to submit the form. If that image
|
||||||
|
is clicked to submit the form, two parameters are added to the form data.
|
||||||
|
The name attribute is suffixed with ".x" and ".y", the values for the
|
||||||
|
parameters are the relative coordinates of the mouse pointer within the
|
||||||
|
image at the time of the click, in pixel. If the name attribute is missing,
|
||||||
|
no parameters will be added to the form data.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
<input type="file" name="...">
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
specifies a file selection box. The user can select a file that should
|
||||||
|
be sent as part of the form data. This is only possible if the encoding
|
||||||
|
is multipart-mime. Unlike other parameters, the file is not mapped to a
|
||||||
|
simple name-value-pair. File upload is not a topic for beginners.
|
||||||
|
|
||||||
|
These tags are used to define parameters in static HTML. With dynamic HTML,
|
||||||
|
in particular JavaScript, the parameter values can be changed before the
|
||||||
|
form is submitted. If that is the case, you are in trouble. Learn JavaScript,
|
||||||
|
analyze the code that is executed, and modify your application to match
|
||||||
|
that behavior.
|
||||||
|
|
||||||
|
|
||||||
|
** Analyze the Form, Again
|
||||||
|
|
||||||
|
After you have determined the action URL and name-value-pairs of
|
||||||
|
a form, you should exit the program you used to get the HTML source,
|
||||||
|
start it again and repeat the analysis with the new page.
|
||||||
|
|
||||||
|
Most parameters will be the same for both pages. But some parameters,
|
||||||
|
in particular those from hidden input fields, may change from session
|
||||||
|
to session, or even with every request. The same can be the case with
|
||||||
|
the action URL.
|
||||||
|
|
||||||
|
Parameters that remain the same can be hard-coded in your program.
|
||||||
|
If parameters change (except for user input), then your application
|
||||||
|
has to request the page with the form and extract the dynamic parameters
|
||||||
|
at runtime. If you're lucky you can locate them by simple string searches.
|
||||||
|
If you're unlucky, you need an HTML parser to make sense of the page.
|
||||||
|
HTML parsing is out of scope for HttpClient, but you'll find some
|
||||||
|
HTML parsers mentioned in the mailing list archives.
|
||||||
|
|
||||||
|
Note that a redesign of the form on the server can break your application
|
||||||
|
at any time. Whenever that happens, you have to repeat the analysis with
|
||||||
|
the new form returned by the server after the redesign, and adjust your
|
||||||
|
application accordingly.
|
||||||
|
|
||||||
|
|
||||||
|
** POST the Form
|
||||||
|
|
||||||
|
After analyzing the form, it is time to create a request that matches
|
||||||
|
what a browser would generate. If the method is GET, just add the
|
||||||
|
name-value-pairs for all parameters to the query string. If the method
|
||||||
|
is POST, things are a little more complicated.
|
||||||
|
|
||||||
|
It depends on the server how closely you have to match browser behavior.
|
||||||
|
For example, a servlet will not distinguish between parameters in the
|
||||||
|
query string and url-encoded parameters of the entity. But other server
|
||||||
|
side code might make that distinction. The safe way is always to match
|
||||||
|
browser behavior exactly.
|
||||||
|
|
||||||
|
HttpClient supports both encoding types, url-encoded and multipart-mime.
|
||||||
|
To send parameters url-encoded, use the POST request and add the parameters
|
||||||
|
directly there. To send parameters in multipart-mime, collect the parameters
|
||||||
|
in a multipart-encoded request entity and add set the entity for the
|
||||||
|
POST request. You will also find support for file upload in the multipart
|
||||||
|
package. Note that these techniques are mutually exclusive, they can not be
|
||||||
|
combined. Parameters defined in the query string of the URL can remain there.
|
||||||
|
|
||||||
|
Send the request. Inspect the response from the server:
|
||||||
|
|
||||||
|
* do you get a status code 303 or 307?
|
||||||
|
|
||||||
|
[]
|
||||||
|
|
||||||
|
That is called a redirect. Follow redirects to the ultimate page
|
||||||
|
and inspect that response. See step 6 on following redirects.
|
||||||
|
|
||||||
|
* do you get the page you expected?
|
||||||
|
|
||||||
|
[]
|
||||||
|
|
||||||
|
If the server response to your POST request indicates a problem,
|
||||||
|
try to enable or disable the expect-continue handshake, or switch
|
||||||
|
the protocol version to HTTP/1.0. If that doesn't help...
|
||||||
|
|
||||||
|
Inspect the request you are sending:
|
||||||
|
|
||||||
|
* are there significant differences to the request of a browser?
|
||||||
|
|
||||||
|
[]
|
||||||
|
|
||||||
|
There is a variety of sniffer programs you can use to grep the
|
||||||
|
browser request. Some of them are mentioned in the responses
|
||||||
|
to {{{http://mail-archives.apache.org/mod_mbox/jakarta-httpclient-user/200603.mbox/%3c981224FF5B88B349B7C1FED584D2620E02A2CBB2@CORPUSMX50B.corp.emc.com%3e this question}on the mailing list}}.
|
||||||
|
|
||||||
|
Candidates for problems are missing or wrong parameters, and differences
|
||||||
|
in the header fields. The parameters are all up to you. As a general rule
|
||||||
|
for the header fields, you should send the same as the browser does. The
|
||||||
|
order of the fields does not matter.
|
||||||
|
|
||||||
|
But there's a caveat: some header fields are controlled by HttpClient and
|
||||||
|
can not be set explicitly. Other header fields are used to indicate
|
||||||
|
capabilities which a browser has, but your application probably has not.
|
||||||
|
For these, the request from your application has to and should differ.
|
||||||
|
Here is a possibly incomplete list of headers that need special consideration:
|
||||||
|
|
||||||
|
<<<Host:>>>
|
||||||
|
|
||||||
|
controlled by HttpClient. The value is usually obtained from the URL
|
||||||
|
you are posting to. It is possible to set a different value, called
|
||||||
|
a "virtual host".
|
||||||
|
|
||||||
|
<<<Content-Type:>>>
|
||||||
|
|
||||||
|
<<<Content-Length:>>>
|
||||||
|
|
||||||
|
<<<Transfer-Encoding:>>>
|
||||||
|
|
||||||
|
controlled by HttpClient. The values are obtained from the request entity.
|
||||||
|
|
||||||
|
<<<Connection:>>>
|
||||||
|
|
||||||
|
usually controlled by HttpClient to handle connection keep-alive.
|
||||||
|
Leave it alone or set the value to "close".
|
||||||
|
|
||||||
|
<<<Content-Encoding:>>>
|
||||||
|
|
||||||
|
used to indicate the capability to process compressed responses.
|
||||||
|
Do not set this, unless you are prepared to implement decompression.
|
||||||
|
|
||||||
|
** Follow Redirects
|
||||||
|
|
||||||
|
It is quite common for servers to respond with a 303 or 307 status code
|
||||||
|
to a POST request. These redirects indicate that your application has to
|
||||||
|
send another request to retrieve the actual result of the operation you
|
||||||
|
have triggered with the POST request.
|
||||||
|
|
||||||
|
HttpClient can be configured to follow some redirects automatically.
|
||||||
|
Others it is not allowed to follow automatically, since RFC 2616 specifies
|
||||||
|
that a user interaction should take place. We will make sure that HttpClient
|
||||||
|
is compliant with this requirement, but we can't stop you from implementing
|
||||||
|
a different behavior in your application. The Location header field in the
|
||||||
|
redirect response indicates the URL from which to fetch the actual page.
|
||||||
|
It is common practice that servers return a relative URL as the location,
|
||||||
|
although the specification requires an absolute URL.
|
||||||
|
|
||||||
|
Note that there may be more than one redirect in succession. Your
|
||||||
|
application then has to follow the redirect for a redirect, but make sure
|
||||||
|
that you do not enter an infinite loop. If you find that there are more
|
||||||
|
than two redirects in succession, something probably is fishy.
|
||||||
|
|
||||||
|
|
||||||
|
** Logout
|
||||||
|
|
||||||
|
Your application can send as many GET and POST requests and follow as many
|
||||||
|
redirects as is required. But you should remember that there is a session
|
||||||
|
tracked by the server. Once your application is done, and if the web site
|
||||||
|
does provide a logout link, you should send a final request to log out.
|
||||||
|
This will tell the server that the session data can be discarded. If the
|
||||||
|
server prevents multiple logins with the same user ID and your application
|
||||||
|
has to run repeatedly, logout may even be required.
|
||||||
|
|
||||||
|
* Further Reading
|
||||||
|
|
||||||
|
ReferenceMaterials: a list of technical specifications for HTTP and related
|
||||||
|
stuff.
|
||||||
|
|
||||||
|
* {{{http://www.w3.org/TR/html4/interact/forms.html} HTML 4.01 Specification,
|
||||||
|
Section on Forms}}: Includes how browsers have to generate the data to submit
|
||||||
|
to the server.
|
||||||
|
|
||||||
|
* {{{http://www.webreference.com/html/tutorial13/} Giving Form to Forms}}:
|
||||||
|
Explains how to define HTML forms and what is submitted to the server.
|
||||||
|
Probably easier to digest than the HTML 4.01 Specification.
|
||||||
|
|
||||||
|
* {{{http://java.sun.com/developer/technicalArticles/InnerWorkings/BackstageSession/index.html}
|
||||||
|
JDC and Session Management}}: Details of a real site using session tracking,
|
||||||
|
login forms and redirects.
|
||||||
|
|
||||||
|
* {{{http://jakarta.apache.org/commons/fileupload/} Commons File Upload}}:
|
||||||
|
Server-side library for parsing multipart requests.
|
||||||
|
|
||||||
|
* {{{http://www.cs.tut.fi/~jkorpela/forms/file.html} Tutorial on File Upload
|
||||||
|
in HTML}}
|
||||||
|
|
||||||
|
[]
|
Binary file not shown.
After Width: | Height: | Size: 11 KiB |
Binary file not shown.
After Width: | Height: | Size: 11 KiB |
|
@ -51,6 +51,10 @@
|
||||||
<item name="Download" href="download.html"/>
|
<item name="Download" href="download.html"/>
|
||||||
<item name="Examples" href="examples.html"/>
|
<item name="Examples" href="examples.html"/>
|
||||||
</menu>
|
</menu>
|
||||||
|
<menu name="Documentation">
|
||||||
|
<item name="Client HTTP Programming Primer" href="primer.html"/>
|
||||||
|
<item name="NTLM support" href="ntlm.html"/>
|
||||||
|
</menu>
|
||||||
<menu name="Modules">
|
<menu name="Modules">
|
||||||
<item name="HttpClient" href="httpclient/index.html"/>
|
<item name="HttpClient" href="httpclient/index.html"/>
|
||||||
<item name="HttpMime" href="httpmime/index.html"/>
|
<item name="HttpMime" href="httpmime/index.html"/>
|
||||||
|
|
Loading…
Reference in New Issue