How to save the HTML source from a website

posted: September 18, 2009

Whether you need to programmatically check the source code from a website for data scraping purposes or otherwise, it is a very easy process - all you need is the following piece of code.

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;

import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;

public class HtmlReader
{

  private String address;
  
  public HtmlReader(String address)
  {
    this.address = address;
  }
  
  public String readHtml()
  {
    StringBuffer html = new StringBuffer();
    try
    {
      URL url = new URL(address);
      HttpURLConnection conn = (HttpURLConnection) url.openConnection();
      BufferedReader br = new BufferedReader(
        new InputStreamReader(
          conn.getInputStream()
        )
      );
      String line;
      while ((line = br.readLine()) != null)
      {
        html.append(line);
      }
      br.close();
    }
    catch (MalformedURLException mue)
    {
      System.out.println(mue.getMessage());
    }
    catch (IOException ioe)
    {
      System.out.println(ioe.getMessage());
    }
    return html.toString();			
  }
  
  public static void main(String[] args)
  {
    if (args.length == 1)
    {
      HtmlReader hr = new HtmlReader(args[0]);
      String html = hr.readHtml();
      System.out.println(html);
    }
    else
    {
      System.out.println("usage: java HtmlReader url");
    }
  }

}

If you run this piece of code with a valid url as it's only argument, the HTML contents of that page will be printed in the console.

Please note that this does NOT save the entire webpage to your disk - only the HTML file you requested. To save an entire webpage, you need to also download the images, css files, flash files and other stuff that is shown on the page before you have an exact copy on your disk.

Recent C++ stuff

Using inclusion guards

How to save the HTML source from a website

Previous articles

Recent C++ stuff

Recent Delphi stuff

Recent Java stuff