Welcome to Marlon's place!


Message of the day


"There is a brave man waiting, who'll take me back to where I come from"


(The Black Heart Rebellion)

How to save the HTML source from a website

posted: September 18, 2009

Whether you need to programmatically check the source code from a website for data scraping purposes or otherwise, it is a very easy process - all you need is the following piece of code.

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;

import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;

public class HtmlReader
{

  private String address;
  
  public HtmlReader(String address)
  {
    this.address = address;
  }
  
  public String readHtml()
  {
    StringBuffer html = new StringBuffer();
    try
    {
      URL url = new URL(address);
      HttpURLConnection conn = (HttpURLConnection) url.openConnection();
      BufferedReader br = new BufferedReader(
        new InputStreamReader(
          conn.getInputStream()
        )
      );
      String line;
      while ((line = br.readLine()) != null)
      {
        html.append(line);
      }
      br.close();
    }
    catch (MalformedURLException mue)
    {
      System.out.println(mue.getMessage());
    }
    catch (IOException ioe)
    {
      System.out.println(ioe.getMessage());
    }
    return html.toString();			
  }
  
  public static void main(String[] args)
  {
    if (args.length == 1)
    {
      HtmlReader hr = new HtmlReader(args[0]);
      String html = hr.readHtml();
      System.out.println(html);
    }
    else
    {
      System.out.println("usage: java HtmlReader url");
    }
  }

} 

If you run this piece of code with a valid url as it's only argument, the HTML contents of that page will be printed in the console.

Please note that this does NOT save the entire webpage to your disk - only the HTML file you requested. To save an entire webpage, you need to also download the images, css files, flash files and other stuff that is shown on the page before you have an exact copy on your disk.

Previous articles

Recent C++ stuff

Recent Delphi stuff

Recent Java stuff