Data Cleaning and Analysis of Use Car Listing on eBay Kleinanzeigen¶

We will be working on dataset of eBay Kleinanzeigen of used cars. The dataset was originally scraped and uploaded to Kaggle. We've made a few modifications from the original dataset that was uploaded to Kaggle:

The data dictionary provided with data is as follows:

dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which which year the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which which year the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.

The aim of this project is to clean the data and analyze the included used car listings.

import pandas as pd
import numpy as np

autos = pd.read_csv("autos.csv", encoding="Latin-1")
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Our dataset contains 20 columns, most of which are stored as strings. There are a few columns with null values, but no columns have more than ~20% null values. There are some columns that contain dates stored as strings.

We'll have to clean the column names to make the data easier to work with.

# lets have detailed look at column names
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pictures', 'postal_code',
       'last_seen']

We'll make a few changes here:

Change the columns from camelcase to snakecase.
Change a few wordings to more accurately describe the columns.

# lets check top 5 rows 
autos.head()

Time to explore Data¶

We will have exolore the data and try to narrow it down what can we do to clean data

autos.describe(include='all')

From above we can see that we find few columns that can be dropped.

offer_type
seller
num_pictures

however before we do that lets examine first to absolutely make sure what's in these columns

autos["seller"].describe()

count      50000
unique         2
top       privat
freq       49999
Name: seller, dtype: object

autos["offer_type"].describe()

count       50000
unique          2
top       Angebot
freq        49999
Name: offer_type, dtype: object

autos["num_pictures"].describe()

count    50000.0
mean         0.0
std          0.0
min          0.0
25%          0.0
50%          0.0
75%          0.0
max          0.0
Name: num_pictures, dtype: float64

Now we are absolutely sure and have data to back up that these columns doesn't add any value to our project and we should drop them from dataset

autos = autos.drop(["num_pictures", "seller", "offer_type"], axis=1)

On the closer look We should also explore following columns as well

ab_test
gearbox
unrepaired_damage

autos["ab_test"].describe()

count     50000
unique        2
top        test
freq      25756
Name: ab_test, dtype: object

autos["gearbox"].value_counts()

manuell      36993
automatik    10327
Name: gearbox, dtype: int64

autos["unrepaired_damage"].value_counts()

nein    35232
ja       4939
Name: unrepaired_damage, dtype: int64

We can conclude we need those columns as they hold key piece of information, so we will keep them

We have also suspect that price and odometer column values are stored as text, so lets confirm if that is the case or not ?

autos["price"].describe()

count     50000
unique     2357
top          $0
freq       1421
Name: price, dtype: object

autos["odometer"].describe()

count         50000
unique           13
top       150,000km
freq          32424
Name: odometer, dtype: object

So, our suspision is spot on. Both column values are stored as text.

We will now remove any non-numeric characters from it and convert column to a numeric type While doing so we will also rename odometer column to odometer_km

autos["price"] = (autos["price"]
                          .str.replace("$","")
                          .str.replace(",","")
                          .astype(int)
                          )
autos["price"].head()

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

autos["odometer"] = (autos["odometer"]
                                .str.replace("km","") 
                                .str.replace(",","")
                                .astype(int) 
                                )
autos.rename({"odometer":"odometer_km"}, axis=1 , inplace=True)
autos["odometer_km"].head()

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer_km, dtype: int64

Explore further and Data Cleaning¶

autos["odometer_km"].value_counts()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

We can certainly assume that all fields are rounded and most of the cars had high odometer reading

print(autos["odometer_km"].describe())

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

print(autos["odometer_km"].unique().shape)

(13,)

As we can see there are 13 unique value for odometer_km

print(autos["price"].describe())
print(autos["price"].value_counts())
print(autos["price"].unique().shape)

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
0         1421
500        781
1500       734
2500       643
1000       639
1200       639
600        531
800        498
3500       498
2000       460
999        434
750        433
900        420
650        419
850        410
700        395
4500       394
300        384
2200       382
950        379
1100       376
1300       371
3000       365
550        356
1800       355
5500       340
1250       335
350        335
1600       327
1999       322
          ... 
46200        1
29600        1
13480        1
21700        1
7373         1
3279         1
4286         1
188          1
17830        1
9130         1
910          1
238          1
2671         1
69900        1
151990       1
2479         1
4510         1
86500        1
47499        1
16998        1
27299        1
41850        1
4780         1
686          1
6495         1
20790        1
8970         1
846          1
2895         1
33980        1
Name: price, Length: 2357, dtype: int64
(2357,)

Wow there is lot going on here. Here is our observation

there are 2357 unique prices of the cars
looks like website only allowed rounded value
there are 1421 cars given away for free which is less than 2% of dataset
maximum car price is 10 times more than most expesive car exists in the world. A staggering 100 Million and most expensive car in world costs around 12.5 Million. (go figure :) )

Lets have closer look at maximum prices

autos["price"].value_counts().sort_index(ascending=False).head(20)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

autos["price"].value_counts().sort_index(ascending=True).head(20)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

autos["price"].value_counts().between(0,3260100)

0         True
500       True
1500      True
2500      True
1000      True
1200      True
600       True
800       True
3500      True
2000      True
999       True
750       True
900       True
650       True
850       True
700       True
4500      True
300       True
2200      True
950       True
1100      True
1300      True
3000      True
550       True
1800      True
5500      True
1250      True
350       True
1600      True
1999      True
          ... 
46200     True
29600     True
13480     True
21700     True
7373      True
3279      True
4286      True
188       True
17830     True
9130      True
910       True
238       True
2671      True
69900     True
151990    True
2479      True
4510      True
86500     True
47499     True
16998     True
27299     True
41850     True
4780      True
686       True
6495      True
20790     True
8970      True
846       True
2895      True
33980     True
Name: price, Length: 2357, dtype: bool

So we will keep our range from $1 to $3,260,100 as it was most expesive used car ever sold on ebay (ref https://www.ebay.com/motors/blog/most-expensive-cars-sold-ebay/)

autos = autos[autos["price"].between(1,3260100)]
autos["price"].describe()

count    4.857000e+04
mean     6.002279e+03
std      1.444526e+04
min      1.000000e+00
25%      1.200000e+03
50%      3.000000e+03
75%      7.490000e+03
max      1.300000e+06
Name: price, dtype: float64

Date Exploration¶

There are total 5 columns that should represent date value. Some of them created by crawler and some are from website.

date_crawled
registration_month
registration_year
ad_created
last_seen

Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas.

lets explore them and see what we discover

autos[['date_crawled','ad_created','last_seen']][0:5]

(autos["date_crawled"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_index()
        )

2016-03-05    0.025324
2016-03-06    0.014042
2016-03-07    0.036010
2016-03-08    0.033292
2016-03-09    0.033086
2016-03-10    0.032180
2016-03-11    0.032572
2016-03-12    0.036916
2016-03-13    0.015668
2016-03-14    0.036545
2016-03-15    0.034280
2016-03-16    0.029607
2016-03-17    0.031645
2016-03-18    0.012909
2016-03-19    0.034775
2016-03-20    0.037883
2016-03-21    0.037389
2016-03-22    0.033004
2016-03-23    0.032222
2016-03-24    0.029339
2016-03-25    0.031604
2016-03-26    0.032201
2016-03-27    0.031089
2016-03-28    0.034857
2016-03-29    0.034116
2016-03-30    0.033683
2016-03-31    0.031830
2016-04-01    0.033683
2016-04-02    0.035475
2016-04-03    0.038604
2016-04-04    0.036504
2016-04-05    0.013095
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

(autos["date_crawled"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_values()
        )

2016-04-07    0.001400
2016-04-06    0.003171
2016-03-18    0.012909
2016-04-05    0.013095
2016-03-06    0.014042
2016-03-13    0.015668
2016-03-05    0.025324
2016-03-24    0.029339
2016-03-16    0.029607
2016-03-27    0.031089
2016-03-25    0.031604
2016-03-17    0.031645
2016-03-31    0.031830
2016-03-10    0.032180
2016-03-26    0.032201
2016-03-23    0.032222
2016-03-11    0.032572
2016-03-22    0.033004
2016-03-09    0.033086
2016-03-08    0.033292
2016-04-01    0.033683
2016-03-30    0.033683
2016-03-29    0.034116
2016-03-15    0.034280
2016-03-19    0.034775
2016-03-28    0.034857
2016-04-02    0.035475
2016-03-07    0.036010
2016-04-04    0.036504
2016-03-14    0.036545
2016-03-12    0.036916
2016-03-21    0.037389
2016-03-20    0.037883
2016-04-03    0.038604
Name: date_crawled, dtype: float64

We can see that site has been crawed for month of March and April and about roughly same time.

(autos["last_seen"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_index()
        )

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005394
2016-03-08    0.007412
2016-03-09    0.009594
2016-03-10    0.010665
2016-03-11    0.012374
2016-03-12    0.023780
2016-03-13    0.008894
2016-03-14    0.012600
2016-03-15    0.015874
2016-03-16    0.016450
2016-03-17    0.028083
2016-03-18    0.007350
2016-03-19    0.015833
2016-03-20    0.020651
2016-03-21    0.020630
2016-03-22    0.021371
2016-03-23    0.018530
2016-03-24    0.019765
2016-03-25    0.019209
2016-03-26    0.016800
2016-03-27    0.015648
2016-03-28    0.020877
2016-03-29    0.022359
2016-03-30    0.024768
2016-03-31    0.023780
2016-04-01    0.022792
2016-04-02    0.024933
2016-04-03    0.025201
2016-04-04    0.024480
2016-04-05    0.124768
2016-04-06    0.221804
2016-04-07    0.131933
Name: last_seen, dtype: float64

autos["registration_year"].describe()

count    48570.000000
mean      2004.754231
std         88.639460
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Well, obviously we can see that two values are not possible as min = 1000 and max = 9999. Car was not invented in year 1000 and we are in 2019 so year 9999 doesn't make any sense.

autos = autos[autos["registration_year"].between(1990,2016)]
autos["registration_year"].value_counts(normalize=True)

2000    0.069535
2005    0.064688
1999    0.063851
2004    0.059554
2003    0.059466
2006    0.058827
2001    0.058078
2002    0.054773
1998    0.052063
2007    0.050169
2008    0.048803
2009    0.045960
1997    0.042986
2011    0.035759
2010    0.035010
1996    0.030251
2012    0.028863
1995    0.027034
2016    0.026880
2013    0.017692
2014    0.014608
1994    0.013859
1993    0.009364
2015    0.008637
1992    0.008174
1990    0.007645
1991    0.007469
Name: registration_year, dtype: float64

Explore the Brands¶

autos["brand"].value_counts(normalize=True)

volkswagen        0.210897
bmw               0.111199
opel              0.108798
mercedes_benz     0.093705
audi              0.087977
ford              0.069998
renault           0.048142
peugeot           0.030670
fiat              0.025646
seat              0.018772
skoda             0.016767
mazda             0.015511
nissan            0.015511
smart             0.014564
citroen           0.014035
toyota            0.012845
hyundai           0.010311
volvo             0.008945
mini              0.008923
mitsubishi        0.008417
honda             0.007844
kia               0.007271
sonstige_autos    0.007073
alfa_romeo        0.006389
suzuki            0.006059
porsche           0.005486
chevrolet         0.005156
chrysler          0.003503
dacia             0.002710
daihatsu          0.002556
jeep              0.002247
subaru            0.002115
land_rover        0.002071
saab              0.001608
daewoo            0.001542
jaguar            0.001542
rover             0.001366
lancia            0.001058
lada              0.000485
trabant           0.000286
Name: brand, dtype: float64

We can see that almost 50% of car sold was produced by German car makers

brand_counts = autos["brand"].value_counts(normalize=True)
common_brands = brand_counts[brand_counts > .05].index
print(common_brands)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

brand_mean_prices = {}

for brand in common_brands:
    brand_only = autos[autos["brand"] == brand]
    mean_price = brand_only["price"].mean()
    brand_mean_prices[brand] = int(mean_price)

brand_mean_prices

{'audi': 9420,
 'bmw': 8604,
 'ford': 3761,
 'mercedes_benz': 8582,
 'opel': 2953,
 'volkswagen': 5398}

so we can see why Volkswagen is very popular because it is falls in mid range car price. We can see Audi, BMW, and Mercedes Benz is high value cars as open and ford is at the lower end of price range.

What about Mileage ?¶

bmp_series = pd.Series(brand_mean_prices)
pd.DataFrame(bmp_series, columns=["mean_price"])

brand_mean_mileage = {}

for brand in common_brands:
    brand_only = autos[autos["brand"] == brand]
    mean_mileage = brand_only["odometer_km"].mean()
    brand_mean_mileage[brand] = int(mean_mileage)

mean_mileage = pd.Series(brand_mean_mileage).sort_values(ascending=False)
mean_prices = pd.Series(brand_mean_prices).sort_values(ascending=False)

brand_info = pd.DataFrame(mean_mileage,columns=['mean_mileage'])
brand_info

brand_info["mean_price"] = mean_prices
brand_info

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	num_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-12 16:06:22	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

	mean_price
audi	9420
bmw	8604
ford	3761
mercedes_benz	8582
opel	2953
volkswagen	5398

	mean_mileage
bmw	132668
mercedes_benz	130848
opel	129864
volkswagen	129101
audi	128974
ford	125281