来自TMDB的5000部电影数据集

您所在的位置:网站首页 tmdb电影数据分析 来自TMDB的5000部电影数据集

来自TMDB的5000部电影数据集

2023-10-05 12:09| 来源: 网络整理| 查看: 265

原文:

TMDB 5000 Movie Dataset

Metadata on ~5,000 movies from TMDb

What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?

This is a great place to start digging in to those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films.

We have removed the original version of this dataset per a DMCA takedown request from IMDB. In order to minimize the impact, we're replacing it with a similar set of films and data fields from The Movie Database (TMDb) in accordance with their terms of use. The bad news is that kernels built on the old dataset will most likely no longer work.

The good news is that:

You can port your existing kernels over with a bit of editing. This kernel offers functions and examples for doing so. You can also find a general introduction to the new format here.

The new dataset contains full credits for both the cast and the crew, rather than just the first three actors.

Actor and actresses are now listed in the order they appear in the credits. It's unclear what ordering the original dataset used; for the movies I spot checked it didn't line up with either the credits order or IMDB's stars order.

The revenues appear to be more current. For example, IMDB's figures for Avatar seem to be from 2010 and understate the film's global revenues by over $2 billion.

Some of the movies that we weren't able to port over (a couple of hundred) were just bad entries. For example, this IMDB entry has basically no accurate information at all. It lists Star Wars Episode VII as a documentary.

Data Source Transfer Details

Several of the new columns contain json. You can save a bit of time by porting the load data functions [from this kernel]().

Even in simple fields like runtime may not be consistent across versions. For example, previous dataset shows the duration for Avatar's extended cut while TMDB shows the time for the original version.

There's now a separate file containing the full credits for both the cast and crew.

All fields are filled out by users so don't expect them to agree on keywords, genres, ratings, or the like.

Your existing kernels will continue to render normally until they are re-run.

If you are curious about how this dataset was prepared, the code to access TMDb's API is posted here.

New columns:

homepage

id

original_title

overview

popularity

production_companies

production_countries

release_date

spoken_languages

status

tagline

vote_average

Lost columns:

actor1facebook_likes

actor2facebook_likes

actor3facebook_likes

aspect_ratio

casttotalfacebook_likes

color

content_rating

directorfacebooklikes

facenumberinposter

moviefacebooklikes

movieimdblink

numcriticfor_reviews

numuserfor_reviews

译:

TMDB 5000电影数据集

来自TMDb的约5000部电影的元数据

在一部电影上映之前,我们能对它的成功说些什么呢?是否有某些公司(皮克斯?)找到了一致的公式?鉴于制作成本超过1亿美元的大型电影仍可能失败,这个问题对电影业来说比以往任何时候都更重要。电影迷可能有不同的兴趣。我们能否预测哪些电影会获得高评价,无论它们是否在商业上取得成功?

这是一个开始深入研究这些问题的好地方,有几千部电影的情节、演员阵容、工作人员、预算和收入的数据。

已根据IMDB的DMCA删除请求删除了该数据集的原始版本。为了将影响降至最低,我们根据电影数据库(TMDb)的使用条款,将其替换为一组类似的电影和数据字段。坏消息是,基于旧数据集构建的内核很可能不再工作。

好消息是:

● 您可以通过一些编辑来移植现有内核。这个内核提供了相关函数和示例。你也可以在这里找到新格式的一般介绍。

● 新的数据集包含演员和剧组的全部学分,而不仅仅是前三名演员。

● 男演员和女演员现在按他们在演员名单中出现的顺序排列。目前尚不清楚原始数据集使用了什么顺序;对于我抽查的电影,它既不符合信用卡订单,也不符合IMDB的明星订单。

● 收入似乎更具流动性。例如,IMDB关于《阿凡达》的数据似乎是从2010年开始的,并且低估了这部电影的全球收入超过20亿美元。

● 有些我们没能搬过去的电影(几百部)只是糟糕的作品。例如,这个IMDB条目基本上没有准确的信息。它将《星球大战》第七集列为纪录片。

数据源传输详细信息

● 几个新列包含json。通过[从这个内核]()移植load data函数,可以节省一些时间。

● 即使在运行时这样的简单字段中,各版本之间也可能不一致。例如,之前的数据集显示了《阿凡达》延长剪辑的持续时间,而TMDB显示了原始版本的时间。

● 现在有一个单独的文件,包含演员和工作人员的全部学分。

● 所有字段都由用户填写,所以不要期望他们在关键词、类型、评分等方面达成一致。

● 现有内核将继续正常渲染,直到重新运行。

● 如果您对这个数据集是如何准备的感到好奇,可以在这里发布访问TMDb API的代码。

新增字段:

homepage

id

original_title

overview

popularity

production_companies

production_countries

release_date

spoken_languages

status

tagline

vote_average

Lost columns:

actor1facebook_likes

actor2facebook_likes

actor3facebook_likes

aspect_ratio

casttotalfacebook_likes

color

content_rating

directorfacebooklikes

facenumberinposter

moviefacebooklikes

movieimdblink

numcriticfor_reviews

numuserfor_reviews



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3