You are hereBlogs / alif's blog / IMDB Details Grabber: Using PHP DOM XPath to extract Movie details
IMDB Details Grabber: Using PHP DOM XPath to extract Movie details
Recently, a friend of mine asked me if there's an IMDB Parser that allows to fetch information about any Movie on IMDB. Unfortunately, IMDB does not offer an API, so , after doing a bit googling I found nice IMDB Grabber (PHP IMDB Grabber), despite it being a good one, I decided develop my own class relying on XPath and Dom Traversing.
Brief Explanation of Methods:
The Class 'class.imdb.php' has 3 useful Methods (in addition other ones). They are:
| get($url) | Fetches the contents of a specific Movie on IMDB. Call this Method by passing the URL and this Method would return as associate array containing the title, image, url, Plot, Director, Release Date and Runtime. |
| useCSV($status) | By Default, when get Method is invoked, values returned are in CSV format. If this Method is invoked with false value as parameter, the values returned from get($url) Method would be a list of array, For example: if a Movie has multiple Directors, by default, it when get($url) Method is invoked, the value of 'Director:' would be a CSV containing multiple Director Names. If useCSV(false) is invoked and then get($url) is called, the associative value of 'Director:' would be an array containing list of Directors. Please see Usage for more info. |
| showCast($t) | The parameter is a boolean. If you wish to grab the cast (some people have requested it), then call this method with a 'true' parameter. Then, when you grab an imdb page, the returned value would also contain the cast in the movies with their thumb image, real name and cast name. |
| showRating($t) | The parameter is a boolean. If you wish not to grab the rating, then call this method with a 'false' parameter. |
Usage (Basic):
Using the class is very easy. Include the class file, initialize the class and then invoke get($url), where $url is the IMDB URL for movie.
include 'path/to/class/class.imdb.php';
$imdbObj = new Imdb();
// movieInfo contains the details of Movies in associative array format
$movieInfo = $imdbObj->get('http://www.imdb.com/title/tt0167260/');
// OR simply enter the movie name :)
$movieInfo = $imdbObj->get('The Matrix');
Here's a vardump of movieInfo:
( [title:] => The Lord of the Rings: The Return of the King (2003) [image:] => http://ia.media-imdb.com/images/M/MV5BMjE4MjA1NTAyMV5BMl5BanBnXkFtZTcwNzM1NDQyMQ@@._V1._SX94_SY140_.jpg [url:] => http://www.imdb.com/title/tt0167260/ [Plot:] => The former Fellowship of the Ring prepare for the final battle for Middle Earth, while Frodo & Sam approach Mount Doom to destroy the One Ring.| [Director:] => Peter Jackson )
That's it, simple enough!. By Default, title, image, url, Plot, Director, Release Date and Runtime is returned. All Keys of the associative array as suffixed with ':' (colon), I leave it upto developers to either strip it out, or just display it.
Usage (Customization and Advanced Usage):
By default, the class fetches the title, image, url, Plot, Director, Release Date and Runtime, Aspect Ratio, Writers, from IMDB.
If you wish to grab the cast in the movie, then call the following method:
$imdbObj->showCast(true);
CSV: By Default, all Values returned are in CSV Format, if you want to get value as list of array, for example for Writers, you way want to get the list of Writers as in Array instead of CSV. In that case do the following:
$imdbObj = new Imdb();
$movieInfo = $imdbObj->showCast(true)
->useCSV(false)
->get('http://www.imdb.com/title/tt0167260/');
View Demo Download from Github
| Attachment | Size |
|---|---|
| test.imdb.php.txt | 642 bytes |
| class.imdb.php.txt | 5.87 KB |
| demo.php_.txt | 2.13 KB |
An alternative I'm using: https://github.com/FabianBeiner/PHP-IMDB-Grabber
"thumb doesn't find correct image; Runtime doesn't work"
Hi, thanks for this great script...
It seems there are a few problems at the moment...
1) Here's the 'thumb' I get in the "cast array":
http://i.media-imdb.com/images/SF984f0c61cc142e750d1af8e5fb4fc0c7/nopict...
Always the same image I get.
2) Runtime seems to not work for me.
3) And next to [release date] I get:
"[Release Date:] => (Italy) »"
And even if it's possible to print this on the same line :D
"[cast] =>
Charlie Croker
"
Is There something you can do?
Thanks anyway...
Hi Alif,
Nice article, but i need your help.. how to intergrade from this Php to wordpress, i means the article from imdb will be in to wordpress database, and posted
coz im only using wordpress platform but not have basic to coding grabber imdb article into my blog
Thank you for your help
hi image not showing it shows only image location.
Image will not show directly, because I believe IMDB blocks image linking from external source. You will have to cache the image locally or use other technique to display them.
Hope that helps.
Hi,
When i use this imdb link http://www.imdb.com/title/tt0076306/ in your demo script then it shows me the english version page of IMDB.
But when i use your script in my localhost then it shows me original european version page not the english one.
Could you please tell me what configuration needed in my server to fetch the english version of imdb.
I'm running into the very same issue these days.
My server was updated and now
all imdb Infos are in unspeakable languages :(
Since I use imdb.class to verify movie titles against the imdb Database to prevent missspelling in my database this is very odd, since basically my new server is in the very same rack my old server was in.
But honestly I have no clue what could have caused this behaviour.
I'm sure There is a workaround you know of ;)
So please let us know what to do.
Maybe a server setting I missed, or the like...
Any help will be appreciated
This might sound like a stupid question, but how do I get access to IMDB if I'm behind a proxy that blocks 'entertainment' content ?
I take it you are referring to the image contents being blocked. You could cache the images locally perhaps. But, it might be a violation of their policy. Please check IMDB's terms of use.
Hi,
Since yesterday, I can't retrieval user rating anymore:
/opt/bin/imdb.php "the matrix"
Array
(
[title:] => Matrix
[year:] => 1999
[url:] => http://www.imdb.com/title/tt0133093/
[image] => http://ia.media-imdb.com/images/M/MV5BMjEzNjg1NTg2NV5BMl5BanBnXkFtZTYwNj...
[Directors:] => Andy Wachowski,Lana Wachowski
[Writers:] => Andy Wachowski,Lana Wachowski
[Storyline:] => Thomas A. Anderson is a man living two lives. By day he is an average computer programmer and by night a malevolent hacker known as Neo. Neo has always questioned his reality but the truth is far beyond his imagination. Neo finds himself targeted by the police when he is contacted by Morpheus, a legendary computer hacker branded a terrorist by the government. Morpheus awakens Neo to the real world, a ravaged wasteland where most of humanity have been captured by a race of machines which live off of their body heat and imprison their minds within an artificial reality known as the Matrix. As a rebel against the machines, Neo must return to the Matrix and confront the agents, super powerful computer programs devoted to snuffing out Neo and the entire human rebellion.
[User Rating:] =>
[Total Votes:] => 441,214 votes
[Genres:] => Action,Adventure,Sci-Fi
[Country:] => USA,Australia
[Language:] => English
[Runtime:] => 136 min
[Aspect Ratio:] => 2.35 : 1
[Release Date:] => 23 June 1999
(Spain) »
[Budget:] => $63,000,000
(estimated)
)
$grabValue['User Rating:'] = $xpath->query("//span[@class='rating-rating']")->item(0)->nodeValue;Cheers, Chris.Hi,
I also notice it now. It seems IMDB may have changed their layout slightly. I will look into it and update it.
Thanks for pointing it out :).
Regards,
$grabValue['User Rating:'] = $xpath->query("//div[@class='rating rating-big']/span[@class='rating-rating']/span[@class='value']/text()")->item(0)->nodeValue; $grabValue['Total Votes:'] = $xpath->query("//div[@class='star-box']/a[@href='ratings']/span[@itemprop='ratingCount']/text()")->item(0)->nodeValue;Nice class btw - i'm using it with my own caching system - and it works perfectly!hi
the demo.php_.txt is not working???
i tryed everything. Please help me.
Not sure why it wont be working. What is the problem you are getting? (:S..I am sure you changed .txt to .php file?)
Hi there,
How can I get the data out of the array to post it into a database with a INSERT INTO query?
Thanks
$grabValue['Genres:'] = $this->getValue("a", $xpath->query("//div[@class='infobar']/a"));Anyway, I saw that imdb is not good for searchin' with the year, so I modified the getImdbURL to get the result whoses matches the right year :public function getImdbURL($url) { $queryStr = '//p[@style]/b/a'; $validTitleStr = "//head/link[@rel='canonical']"; $searchURL = 'http://www.imdb.com/find?q='.urlencode($url); $searchDom = new DomDocument(); $searchLoad = $searchDom->loadHTMLFile($searchURL); $xpath = new DomXPath($searchDom); $query = '//table/tr/td[@valign="top"]/a/..'; #ok $items = $xpath->query($query); $totalItem = $items->length; $linkHrefURL == ""; for($i = 0; $i < $totalItem && $linkHrefURL == ""; $i+=3) { $titleWithAkas = $items->item($i+2)->nodeValue; $title = preg_match("/([\w ]+)\s*\(([0-9]{4})\).*/", $titleWithAkas, $matches); $title = preg_match("/([0-9]{4})/", $url, $matches_year_in_url); if ($matches_year_in_url) { if ($matches_year_in_url[1] == $matches[2]) { $allLinks = $xpath->query('a', $items->item($i+2)); $linkHrefURL = $allLinks->item(0)->getAttribute('href'); $linkHrefURL = 'http://www.imdb.com'.$linkHrefURL; } } } if ($linkHrefURL == "") { # fallback $linkHrefURL = $xpath->query($validTitleStr)->item(0)->getAttribute('href'); } if($this->isValidURL($linkHrefURL) ) { return $linkHrefURL; } if($xpath->query($queryStr)->length > 0 ) { return 'http://www.imdb.com'.$xpath->query($queryStr)->item(0)->getAttribute('href'); } else { return false; } } }thanks for your script.Thanks, these scripts are really awesome, i've been maintaining a database of my movies for years and now i'm able to update it now to retrive the imdb details.
Is there any way to also get the producer/composer details?
Hello,
Thanks for using my script. Many movies at imdb do not have composer/producer (unless I have missed something), so, if you want you add an xpath expression for the producer/composer, but it won't work for all movies.
I have tried to look by the example, but there are nothing, when i want to dumb the Array_vars.
include 'class.imdb.php';
error_reporting(E_ERROR);
$imdbObj = new Imdb();
$imdbObj->showRating(true);
$movieInfo = $imdbObj->get('Undisputed');
print_r($movieInfo);
Here the link:
http://skullteria.byethost32.com/Nebenbei/imdbGrabber/test.imdb.php
I am not sure why it's not working. Can you provide the source to your code, so I can have a look at it?
Thanks,
Could this script works also with the italian version of imdb (imdb.it)? Thanks for this nice script...
I haven't tried it, but looking at the layout out imdb.it it may or may not work, I am not sure!
How do you get the nice display of the contents like you have on your demo page? what code should I use to achieve the same results?
I have added a demo.php (demo.php_.txt) file with the article. Please download it. It has a similar layout as my demo page.
Thanks,
The cast members will only come up if they are hyperlined at IMDB, how can we get all the cast?? And how can we add budget info? Great stuff!!
Hey
I wanted to request to ask u some thing can u make a code or sm thing that when the grabber grabs images it can change the size of the image to our desired size
Is it possible to get the exact match;
I tried searching for the movie "3 idiots" and it always pulls the popular match not the exact match..
even when i search for "3 idiots (2009)" it gives the popular match..
Great script.. keep it up...
I am having the same problem. I am working on a database driven php site and everything builds and populates off the info that is scraped by this script.
But when I try to pull in movies like Shawshank Redemption or Hereafter among others I end up with the most popular (foreign) title. I am going to look through the script but hoping someone has already solved the issue.
I tried this script and have some issues with the movies where there are part 2 or 3
example when i type Paranormal Activity 2 it still points at the first one
even when i try Paranormal Activity (2010) it gives the same result..
Is there a fix for this?
Thank you..
Love this script...
Fatal error: Call to undefined method Imdb::add()
$movieInfo = $imdbObj->add('Writers:', '/div/a')
->add('Genre:', '/div/a')
don't work for me...
Hello,
I have removed the add method from the class. The new Imdb layout doesnot follow a consistent pattern unlike the previous layout. Currently, its commented on the test class.
I will update my test class and remove that piece of code altogether to avoid confusion.
Thanks,
I love this script. Is there anyway to get the list of genres for the movie?
Hey there,
my suggestion is to add the following lines after grabbing the storyline
// grab genres:
$grabValue['Genres:'] = "";
$genresPath = $xpath->query("//div[@class='see-more inline canwrap'][h4='Genres:']/a/text()");
for($i=0;$i < $genresPath->length; $i++) {
$grabValue['Genres:'] .= $genresPath->item($i)->nodeValue .", ";
}
$grabValue['Genres:'] = rtrim($grabValue['Genres:']," ,");
Again, great script!
Alex
Thanks! That works perfectly.
Thanks.. There's still some polishing that needs to be done on the class. I have been too busy and have not been able to sit with it.
Thanks again for the genre addition =).
HEy,
can you publish your php from your demo page, its very good formated and without the array,.........
Big Thx
Gret
http://www.imdb.com/title/tt0860462/
this IMDB link dosen't work. how to fix it
thank you
thanks, i hope this works
Great script! Thank you.
I would just add few lines to capture the "Release date": Starting from line 160 I repeated the code in the loop above with a small modification:
$nodeNameList=$xpath->query('h4',$nodeList->item($i));
$grabName=trim($nodeNameList->item(0)->nodeValue);
$nodeValueList=$xpath->query('text()', $nodeList->item($i));
$grabValue[$grabName]=$this->getValue('/text()',$nodeValueList);
Alex
Hey, Thanks for trying out. I have added the Release Date now, all you need to do is add an extra line on the constructor (around line 86) to get the Release Date.
$this->defaultList = array (
.....
'Release Date:' => '/text()'
);
This should solve it.
Thanks,
This script not working to new IMDB layout. do you have updated script?
Thank you for pointing it out. I just noticed IMDB changed their layout completely. I will look into it asap.
Thanks again.
I updated my script to grab from the new Layout. It seems to be working fine with no issues.
Thanks,
hello...i im trying to use this excellent script but can't to make to work to fetch following data:
studio, writer, producer, cast, audiencerating and rating.
for example i use folloving code:
$movieInfo = $imdbObj->add('Writer:', '/div/text()');
echo $movieInfo['Writer:']."";
but result is empty.
here is imdb source:
Writer:
Barry Levy
but i think that problem is /name/nm1633356/
so could fix this?
many thanks.
Your script is really nice but my script need to show search result from imdb and then grab infos... how can i do that? Thank you very much. :)
I found out that the script works, it get the information, but for some reason I get all of those DOMDocument errors.
A ideas on how to get them to go away?
Best regards,
Dan
PS: Awsome script :)
Hi Dan,
Thanks for trying out the script. These are actually warnings not errors in PHP. They happen because the IMDB HTML page are not well formed and/or are not Valid HTML.
At the top of your page where you are calling the class.imdb.php, you can use the following line:
// only report errors in PHP and ignore warnings.
error_reporting(E_ERROR);
I have also used this line on the test.imdb.php file provided on the blog.
I hope this helps,
Hey,
This is just what I was looking for, however I am getting alot of error messages when I try to use this class to get information from IMDB.
Ex:
Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: Tag layer invalid in http://www.imdb.com/find?q=Taken, line: 113 in /Applications/XAMPP/xamppfiles/htdocs/film_system/etc/imdb.class.php on line 241
Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: htmlParseEntityRef: no name in http://www.imdb.com/find?q=Taken, line: 191 in /Applications/XAMPP/xamppfiles/htdocs/film_system/etc/imdb.class.php on line 241
They come from the two lines in your class.imdb.php document that contain loadHTMLFile(), ex:
public function getImdbURL($url) {
$queryStr = '//p[@style]/b/a';
$validTitleStr = "//head/link[@rel='canonical']";
$searchURL = 'http://www.imdb.com/find?q='.urlencode($url);
$searchDom = new DomDocument();
*********$searchLoad = $searchDom->loadHTMLFile($searchURL);*********
$xpath = new DomXPath($searchDom);
// check to see if Imdb has directly redirected to the movie page. It happens for some movies (Ex: Shutter Island as someone pointed out).
$linkHrefURL = $xpath->query($validTitleStr)->item(0)->getAttribute('href');
if($this->isValidURL($linkHrefURL) ) {
return $linkHrefURL;
}
if($xpath->query($queryStr)->length > 0 ) {
return 'http://www.imdb.com'.$xpath->query($queryStr)->item(0)->getAttribute('href');
} else {
return false;
}
}
}
Any ideas on what I am doing wrong?
I have tested your script on two separate servers, both running PHP 5.2.14 with all aspects of DOM enabled, on one it loads everything, on the other Shutter Island doesn't load, instead it gives a Warning about DOMDocument being empty after using loadHTML and a Fatal Exception when attempting to use getAttribute on a non DOMDocument object. Do you know what could be causing the problem? Thanks.
Post new comment