IMDB Details Grabber: Using PHP DOM XPath to extract Movie details

Tags:

Jan 22, 2010 update: Now you can grab "cast" in the movie as well ;).
Recently, a friend of mine asked me if there's an IMDB Parser that allows to fetch information about any Movie on IMDB. Unfortunately, IMDB does not offer an API, so , after doing a bit googling I found a couple of nice IMDB Grabbers (PHP IMDB Grabber, PHP Classes IMDB), despite both being very good ones, I decided develop my own class. It will rely on XPath, and it would also allow developers to extract additional information easily without writing much more code.

Brief Explanation of Methods:

The Class 'class.imdb.php' has 3 useful Methods (in addition other ones). They are:

get($url) Fetches the contents of a specific Movie on IMDB. Call this Method by passing the URL and this Method would return as associate array containing the title, image, url, Plot, Director, Release Date and Runtime.
add($name,$value='') This Method accepts either 2 or 1 parameters. This provides developers the ability to extract additional data in addition to the default fields (i.e. title,image,url etc.) mentioned above. Please see below how to use this Method
useCSV($status) By Default, when get Method is invoked, values returned are in CSV format. If this Method is invoked with false value as parameter, the values returned from get($url) Method would be a list of array, For example: if a Movie has multiple Directors, by default, it when get($url) Method is invoked, the value of 'Director:' would be a CSV containing multiple Director Names. If useCSV(false) is invoked and then get($url) is called, the associative value of 'Director:' would be an array containing list of Directors. Please see Usage for more info.
showCast($t) The parameter is a boolean. If you wish to grab the cast (some people have requested it), then call this method with a 'true' parameter. Then, when you grab an imdb page, the returned value would also contain the cast in the movies with their thumb image, real name and cast name.

Usage (Basic):

Using the class is very easy. Include the class file, initialize the class and then invoke get($url), where $url is the IMDB URL for movie.

include 'path/to/class/class.imdb.php';
$imdbObj   = new Imdb();

// movieInfo contains the details of Movies in associative array format
$movieInfo = $imdbObj->get('http://www.imdb.com/title/tt0167260/');

Here's a vardump of movieInfo:

  1. (
  2. [title:] => The Lord of the Rings: The Return of the King (2003)
  3. [image:] => http://ia.media-imdb.com/images/M/MV5BMjE4MjA1NTAyMV5BMl5BanBnXkFtZTcwNzM1NDQyMQ@@._V1._SX94_SY140_.jpg
  4. [url:] => http://www.imdb.com/title/tt0167260/
  5. [Plot:] => The former Fellowship of the Ring prepare for the final battle for Middle Earth, while Frodo & Sam approach Mount Doom to destroy the One Ring.|
  6. [Director:] => Peter Jackson
  7. [Release Date:] => 17 December 2003 (USA)
  8. [Runtime:] => 201 min | 251 min (extended edition)
  9. )

That's it, simple enough!. By Default, title, image, url, Plot, Director, Release Date and Runtime is returned. All Keys of the associative array as suffixed with ':' (colon), I leave it upto developers to either strip it out, or just display it.

Usage (Customization and Advanced Usage):

By default, the class fetches the title, image, url, Plot, Director, Release Date and Runtime which is fine for most everyday work, but a developer may want to access additional information like Awards, Aspect Ratio, Writers, etc for a movie from IMDB. This class allows developers to do that easily, just by calling the function get. Also, I have developed the class by using Method Chaining Technique (thanks to jQuery for teaching me that), so, you can chain Methods easily.

If you wish to grab the cast in the movie, then call the following method:

$imdbObj->showCast(true);

Below, I will be using this URL: http://www.imdb.com/title/tt0167260/
Lets say, in addition to default Fields, you want to know the Writers on a movie. So, open up the IMDB Page('http://www.imdb.com/title/tt0167260/') page, and view its source on the area of Awards (you can view source in Firefox by selecting that region and right-click and then view Selection Source, or by using Firebug), you will see something like this on source code:
//..
Awards:
Won 11 Oscars. Another 106 wins & 68 nominations more
//..
You want to extract the raw text 'Won 11 Oscards, ..'. So, Call add Method and pass 'Awards:' on 1st parameter (exact value of text inside h5) and on 2nd parameter you pass '/text()' (as its raw text):

$imdbObj   = new Imdb();
$imdbObj->add('Awards:','/div/text()');
$movieInfo = $imdbObj->get('http://www.imdb.com/title/tt0167260/');

Infact, the class allows you to chain Methods, so, you could just do this:


$imdbObj   = new Imdb();
$movieInfo = $imdbObj->add('Awards:','/div/text()')->get('http://www.imdb.com/title/tt0167260/');

Now, $movieInfo will also contain information on Awards. Here's a vardump of $movieInfo

Array
(
    [title:] => The Lord of the Rings: The Return of the King (2003)
    [image:] => http://ia.media-imdb.com/images/M/MV5BMjE4MjA1NTAyMV5BMl5BanBnXkFtZTcwNzM1NDQyMQ@@._V1._SX94_SY140_.jpg
    [url:] => http://www.imdb.com/title/tt0167260/
    [Plot:] => The former Fellowship of the Ring prepare for the final battle for Middle Earth, while Frodo & Sam approach Mount Doom to destroy the One Ring.|
    [Director:] => Peter Jackson
    [Release Date:] => 17 December 2003 (USA)
    [Runtime:] => 201 min  | 251 min (extended edition)
    [Awards:] => Won 11 Oscars. 
Another 106 wins 
& 68 nominations
)

Simple? right. Now, lets say, you want to know both Awards and the Writers of that Movie, so, look at the source code on IMDB Page again on Writers area. Here's how it looks like:


Writers (WGA):
J.R.R. Tolkien (novel)
Fran Walsh (screenplay) ...
more

So, call add Method again. add Method also accepts associative array on 1st parameter, so you don't have to chain/call add again and again multiple times with the imdbObj. For writers, on 1st parameter, you pass 'Writers (WGA):' (the raw text inside h5 without tags) and on 2nd param, you pass '/a', because all the writers name are inside 'a' (anchor) tags. Here's how to do it:


$imdbObj = new Imdb();
$movieInfo = $imdbObj->add( array('Awards:' => '/div/text()' , 
                            'Writers (WGA):' => '/div/a'))
                     ->get('http://www.imdb.com/title/tt0167260/');

Thats about. Now, here's a vardump of $movieInfo

Array
(
    [title:] => The Lord of the Rings: The Return of the King (2003)
    [image:] => http://ia.media-imdb.com/images/M/MV5BMjE4MjA1NTAyMV5BMl5BanBnXkFtZTcwNzM1NDQyMQ@@._V1._SX94_SY140_.jpg
    [url:] => http://www.imdb.com/title/tt0167260/
    [Plot:] => The former Fellowship of the Ring prepare for the final battle for Middle Earth, while Frodo & Sam approach Mount Doom to destroy the One Ring.|
    [Director:] => Peter Jackson
    [Release Date:] => 17 December 2003 (USA)
    [Runtime:] => 201 min  | 251 min (extended edition)
    [Awards:] => Won 11 Oscars.
 Another 106 wins
&
68 nominations
    [Writers (WGA):] => J.R.R. Tolkien,Fran Walsh,more
)
Note: I leave up to developer to strip out 'more' keyword from the array values

Almost, all the Information on IMDB movie page is stored in this format:


Name_of_Data:
_RAW_TEXT_ TEXT_1TEXT_2
And, you will always want to extract either the '_RAW_TEXT_' or the set of 'TEXT_1', 'TEXT_2' inside contained in tag 'TAG_NAME'. So, when calling addMethod, if you want to grab '_RAW_TEXT_, just pass '/div/text()'. If you want to grab 'TEXT_1', 'TEXT_2' etc. which are inside other tags.. just pass '/div/TAG_NAME' inside 2nd parameter. For the 1st parameter, always pass the text value with spaces contained in h5 ('Name_of_Data:')

CSV: By Default, all Values returned are in CSV Format, if you want to get value as list of array, for example for Writers, you way want to get the list of Writers as in Array instead of CSV. In that case do the following:


$imdbObj   = new Imdb();
$movieInfo = $imdbObj->useCSV(false)->get('http://www.imdb.com/title/tt0167260/');

View Demo

AttachmentSize
test.imdb.php.txt642 bytes
class.imdb.php.txt5.87 KB

Nice script... Its simple and

Nice script...
Its simple and easy to use. Thanks for sharing.

Is there anyway cast can be

Is there anyway cast can be grabbed?. I am looking at this script and but I need a way to grab the cast. Is it possible?

Yes, I have added the ability

Yes, I have added the ability to grab cast now :)

Requesting some help

Thanks for making this. I have been looking for something like this to manage my persona movie collection using PHP / MySQL.

When I put your 2 files in my xampp/htdocs/moviecritc folder and run it through http://localhost/moviecritc/test.imdb.php in the browser, I get the following error:

http://i.imgur.com/YGRxG.png

Can you please help identify this? Thanks.

nice writing

This piece of writing is excellent and I liked it a lot.

Awesome!

This is really awesome. Really easy to use.

Thanks

cant get the cast

Hi there

i have used your script. it works great. but whatever i do i cant retrieve the CAST of the movie from that page. Please let me know if its possible
cheers

I think yours will be even

I think yours will be even better than these alternatives. Looks pretty straight forward.

Doesn't seem to work !

The code doesn't seem to work, I get error messages, it cannot find DomDocument ? Where is that declared..???

Can you elaborate more on the error?

Can you please explain what error you are getting. DomDocument is the standard Class of PHP-5. Ensure that you are using PHP-5.1 atleast.

Error with PHP

For some reason, I'm getting the following error:
Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: htmlParseEntityRef: no name in http://www.imdb.com/title/tt0167260/, line: 642 in class.imdb.php on line 146

Over and over and over again, but the result does show at the end.

Thanks, I will look into it

Thanks for trying it out. From the messages on your comment, it seems to be a warning, not an error. I will look into it. For the time being, if you wish the warning message not to be displayed, you can use error_reporting feature of php to not show the warning.

http://us2.php.net/manual/en/function.error-reporting.php

good one!!!

thanks :-)
i'll try to use it soon...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
7 + 4 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.