Amazon assessment information

Julian McAuley, UCSD

New!: See our up to date (2018) model of the Amazon information right here

See quite a lot of different datasets for recommender techniques analysis on our lab’s dataset webpage

Description

This dataset incorporates product evaluations and metadata from Amazon, together with 142.8 million evaluations spanning Could 1996 – July 2014.

This dataset consists of evaluations (rankings, textual content, helpfulness votes), product metadata (descriptions, class data, value, model, and picture options), and hyperlinks (additionally seen/additionally purchased graphs).

Information

“Small” subsets for experimentation

When you’re utilizing this information for a category venture (or related) please think about using certainly one of these smaller datasets beneath earlier than requesting the bigger recordsdata. To acquire the bigger recordsdata you have to to contact me to acquire entry.

Ok-cores (i.e., dense subsets): These information have been decreased to extract the k-core, such that every of the remaining customers and gadgets have okay evaluations every.

Scores solely: These datasets embody no metadata or evaluations, however solely (consumer,merchandise,score,timestamp) tuples. Thus they’re appropriate to be used with mymedialite (or related) packages.

Books 5-core (8,898,041 evaluations) rankings solely (22,507,155 rankings)
Electronics 5-core (1,689,188 evaluations) rankings solely (7,824,482 rankings)
Motion pictures and TV 5-core (1,697,533 evaluations) rankings solely (4,607,047 rankings)
CDs and Vinyl 5-core (1,097,592 evaluations) rankings solely (3,749,004 rankings)
Clothes, Footwear and Jewellery 5-core (278,677 evaluations) rankings solely (5,748,920 rankings)
Dwelling and Kitchen 5-core (551,682 evaluations) rankings solely (4,253,926 rankings)
Kindle Retailer 5-core (982,619 evaluations) rankings solely (3,205,467 rankings)
Sports activities and Outside 5-core (296,337 evaluations) rankings solely (3,268,695 rankings)
Cell Telephones and Equipment 5-core (194,439 evaluations) rankings solely (3,447,249 rankings)
Well being and Private Care 5-core (346,355 evaluations) rankings solely (2,982,326 rankings)
Toys and Video games 5-core (167,597 evaluations) rankings solely (2,252,771 rankings)
Video Video games 5-core (231,780 evaluations) rankings solely (1,324,753 rankings)
Instruments and Dwelling Enchancment 5-core (134,476 evaluations) rankings solely (1,926,047 rankings)
Magnificence 5-core (198,502 evaluations) rankings solely (2,023,070 rankings)
Apps for Android 5-core (752,937 evaluations) rankings solely (2,638,172 rankings)
Workplace Merchandise 5-core (53,258 evaluations) rankings solely (1,243,186 rankings)
Pet Provides 5-core (157,836 evaluations) rankings solely (1,235,316 rankings)
Automotive 5-core (20,473 evaluations) rankings solely (1,373,768 rankings)
Grocery and Connoisseur Meals 5-core (151,254 evaluations) rankings solely (1,297,156 rankings)
Patio, Garden and Backyard 5-core (13,272 evaluations) rankings solely (993,490 rankings)
Child 5-core (160,792 evaluations) rankings solely (915,446 rankings)
Digital Music 5-core (64,706 evaluations) rankings solely (836,006 rankings)
Musical Devices 5-core (10,261 evaluations) rankings solely (500,176 rankings)
Amazon Prompt Video 5-core (37,126 evaluations) rankings solely (583,933 rankings)

Full assessment information

Please see the per-category recordsdata beneath, and solely obtain these (giant!) recordsdata if you really want them:

uncooked assessment information (20gb) – all 142.8 million evaluations

The above file incorporates some duplicate evaluations, primarily on account of near-identical merchandise whose evaluations Amazon merges, e.g. VHS and DVD variations of the identical film. These duplicates have been eliminated within the recordsdata beneath:

consumer assessment information (18gb) – duplicate gadgets eliminated (83.68 million evaluations), sorted by consumer

product assessment information (18gb) – duplicate gadgets eliminated, sorted by product

rankings solely (3.2gb) – identical as above, in csv kind with out evaluations or metadata

5-core (9.9gb) – subset of the info through which all customers and gadgets have at the very least 5 evaluations (41.13 million evaluations)

Lastly, the next file removes duplicates extra aggressively, eradicating duplicates even when they’re written by completely different customers. This accounts for customers with a number of accounts or plagiarized evaluations. Such duplicates account for lower than 1 p.c of evaluations, although this dataset might be preferable for sentiment evaluation sort duties:

aggressively deduplicated information (18gb) – no duplicates in anyway (82.83 million evaluations)

Format is one-review-per-line in (free) json. See examples beneath for additional assist studying the info.

Pattern assessment:

{ “reviewerID”: “A2SUAM1J3GNN3B”, “asin”: “0000013714”, “reviewerName”: “J. McDonald”, “useful”: [2, 3], “reviewText”: “I purchased this for my husband who performs the piano. He’s having an exquisite time enjoying these previous hymns. The music is at occasions laborious to learn as a result of we predict the e-book was revealed for singing from greater than enjoying from. Nice buy although!”, “total”: 5.0, “abstract”: “Heavenly Freeway Hymns”, “unixReviewTime”: 1252800000, “reviewTime”: “09 13, 2009” }

the place

  • reviewerID – ID of the reviewer, e.g. A2SUAM1J3GNN3B
  • asin – ID of the product, e.g. 0000013714
  • reviewerName – identify of the reviewer
  • useful – helpfulness score of the assessment, e.g. 2/3
  • reviewText – textual content of the assessment
  • total – score of the product
  • abstract – abstract of the assessment
  • unixReviewTime – time of the assessment (unix time)
  • reviewTime – time of the assessment (uncooked)

Metadata

Metadata consists of descriptions, value, sales-rank, model data, and co-purchasing hyperlinks:

metadata (3.1gb) – metadata for 9.4 million merchandise

Pattern metadata:

{ “asin”: “0000031852”, “title”: “Women Ballet Tutu Zebra Sizzling Pink”, “value”: 3.17, “imUrl”: “http://ecx.images-amazon.com/pictures/I/51fAmVkTbyL._SY300_.jpg”, “associated”: { “also_bought”: [“B00JHONN1S”, “B002BZX8Z6”, “B00D2K1M3O”, “0000031909”, “B00613WDTQ”, “B00D0WDS9A”, “B00D0GCI8S”, “0000031895”, “B003AVKOP2”, “B003AVEU6G”, “B003IEDM9Q”, “B002R0FA24”, “B00D23MC6W”, “B00D2K0PA0”, “B00538F5OK”, “B00CEV86I6”, “B002R0FABA”, “B00D10CLVW”, “B003AVNY6I”, “B002GZGI4E”, “B001T9NUFS”, “B002R0F7FE”, “B00E1YRI4C”, “B008UBQZKU”, “B00D103F8U”, “B007R2RM8W”], “also_viewed”: [“B002BZX8Z6”, “B00JHONN1S”, “B008F0SU0Y”, “B00D23MC6W”, “B00AFDOPDA”, “B00E1YRI4C”, “B002GZGI4E”, “B003AVKOP2”, “B00D9C1WBM”, “B00CEV8366”, “B00CEUX0D8”, “B0079ME3KU”, “B00CEUWY8K”, “B004FOEEHC”, “0000031895”, “B00BC4GY9Y”, “B003XRKA7A”, “B00K18LKX2”, “B00EM7KAG6”, “B00AMQ17JA”, “B00D9C32NI”, “B002C3Y6WG”, “B00JLL4L5Y”, “B003AVNY6I”, “B008UBQZKU”, “B00D0WDS9A”, “B00613WDTQ”, “B00538F5OK”, “B005C4Y4F6”, “B004LHZ1NY”, “B00CPHX76U”, “B00CEUWUZC”, “B00IJVASUE”, “B00GOR07RE”, “B00J2GTM0W”, “B00JHNSNSM”, “B003IEDM9Q”, “B00CYBU84G”, “B008VV8NSQ”, “B00CYBULSO”, “B00I2UHSZA”, “B005F50FXC”, “B007LCQI3S”, “B00DP68AVW”, “B009RXWNSI”, “B003AVEU6G”, “B00HSOJB9M”, “B00EHAGZNA”, “B0046W9T8C”, “B00E79VW6Q”, “B00D10CLVW”, “B00B0AVO54”, “B00E95LC8Q”, “B00GOR92SO”, “B007ZN5Y56”, “B00AL2569W”, “B00B608000”, “B008F0SMUC”, “B00BFXLZ8M”], “bought_together”: [“B002BZX8Z6”] }, “salesRank”: {“Toys & Video games”: 211836}, “model”: “Coxlures”, “classes”: [[“Sports & Outdoors”, “Other Sports”, “Dance”]] }

the place

  • asin – ID of the product, e.g. 0000031852
  • title – identify of the product
  • value – value in US {dollars} (at time of crawl)
  • imUrl – url of the product picture
  • associated – associated merchandise (additionally purchased, additionally seen, purchased collectively, purchase after viewing)
  • salesRank – gross sales rank data
  • model – model identify
  • classes – record of classes the product belongs to

Visible Options

We extracted visible options from every product picture utilizing a deep CNN (see quotation beneath). Picture options are saved in a binary format, which consists of 10 characters (the product ID), adopted by 4096 floats (repeated for each product). See recordsdata beneath for additional assist studying the info.

visible options (141gb) – visible options for all merchandise

The pictures themselves could be extracted from the imUrl subject within the metadata recordsdata.

Per-category recordsdata

Beneath are recordsdata for particular person product classes, which have already had duplicate merchandise evaluations eliminated.

Books evaluations (22,507,155 evaluations) metadata (2,370,585 merchandise) picture options
Electronics evaluations (7,824,482 evaluations) metadata (498,196 merchandise) picture options
Motion pictures and TV evaluations (4,607,047 evaluations) metadata (208,321 merchandise) picture options
CDs and Vinyl evaluations (3,749,004 evaluations) metadata (492,799 merchandise) picture options
Clothes, Footwear and Jewellery evaluations (5,748,920 evaluations) metadata (1,503,384 merchandise) picture options
Dwelling and Kitchen evaluations (4,253,926 evaluations) metadata (436,988 merchandise) picture options
Kindle Retailer evaluations (3,205,467 evaluations) metadata (434,702 merchandise) picture options
Sports activities and Outside evaluations (3,268,695 evaluations) metadata (532,197 merchandise) picture options
Cell Telephones and Equipment evaluations (3,447,249 evaluations) metadata (346,793 merchandise) picture options
Well being and Private Care evaluations (2,982,326 evaluations) metadata (263,032 merchandise) picture options
Toys and Video games evaluations (2,252,771 evaluations) metadata (336,072 merchandise) picture options
Video Video games evaluations (1,324,753 evaluations) metadata (50,953 merchandise) picture options
Instruments and Dwelling Enchancment evaluations (1,926,047 evaluations) metadata (269,120 merchandise) picture options
Magnificence evaluations (2,023,070 evaluations) metadata (259,204 merchandise) picture options
Apps for Android evaluations (2,638,173 evaluations) metadata (61,551 merchandise) picture options
Workplace Merchandise evaluations (1,243,186 evaluations) metadata (134,838 merchandise) picture options
Pet Provides evaluations (1,235,316 evaluations) metadata (110,707 merchandise) picture options
Automotive evaluations (1,373,768 evaluations) metadata (331,090 merchandise) picture options
Grocery and Connoisseur Meals evaluations (1,297,156 evaluations) metadata (171,760 merchandise) picture options
Patio, Garden and Backyard evaluations (993,490 evaluations) metadata (109,094 merchandise) picture options
Child evaluations (915,446 evaluations) metadata (71,317 merchandise) picture options
Digital Music evaluations (836,006 evaluations) metadata (279,899 merchandise) picture options
Musical Devices evaluations (500,176 evaluations) metadata (84,901 merchandise) picture options
Amazon Prompt Video evaluations (583,933 evaluations) metadata (30,648 merchandise) picture options

Quotation

Please cite one or each of the next for those who use the info in any means:

Ups and downs: Modeling the visible evolution of vogue developments with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016
pdf

Picture-based suggestions on types and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
pdf

Code

Studying the info

Knowledge could be handled as python dictionary objects. A easy script to learn any of the above the info is as follows:

def parse(path): g = gzip.open(path, ‘r’) for l in g: yield eval(l)

Convert to ‘strict’ json

The above information could be learn with python ‘eval’, however shouldn’t be strict json. If you would like to make use of some language apart from python, you may convert the info to strict json as follows:

import json import gzip def parse(path): g = gzip.open(path, ‘r’) for l in g: yield json.dumps(eval(l)) f = open(“output.strict”, ‘w’) for l in parse(“reviews_Video_Games.json.gz”): f.write(l + ‘n’)

Pandas information body

This code reads the info right into a pandas information body:

import pandas as pd import gzip def parse(path): g = gzip.open(path, ‘rb’) for l in g: yield eval(l) def getDF(path): i = 0 df = {} for d in parse(path): df[i] = d i += 1 return pd.DataFrame.from_dict(df, orient=’index’) df = getDF(‘reviews_Video_Games.json.gz’)

Learn picture options

import array def readImageFeatures(path): f = open(path, ‘rb’) whereas True: asin = f.learn(10) if asin == ”: break a = array.array(‘f’) a.fromfile(f, 4096) yield asin, a.tolist()

Instance: compute common score

rankings = [] for assessment in parse(“reviews_Video_Games.json.gz”): rankings.append(assessment[‘overall’]) print sum(rankings) / len(rankings)

Instance: latent-factor mannequin in mymedialite

Predicts rankings from a rating-only CSV file

./rating_prediction –recommender=BiasedMatrixFactorization –training-file=ratings_Video_Games.csv –test-ratio=0.1

Leave a Reply

Your email address will not be published. Required fields are marked *