-
Notifications
You must be signed in to change notification settings - Fork 0
/
3. streamlined data ingestion with pandas.txt
660 lines (544 loc) · 26.5 KB
/
3. streamlined data ingestion with pandas.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
// LESSON 1.1 - INTRODUCTION TO FLAT FILES
data frames
pandas - specific structure for two dimensional data
flat files
- simple, easy to produce format
- data stored as plain text (no formatting)
- one row per line
- values for different fields are separated by a delimiter
- most common flat file type: comma separated values
loading csv
import pandas as pd
tax_data = pd.read_csv("us_tax_data.csv")
tax_data.head(4)
loading other flat files
# Import pandas with the alias pd
import pandas as pd
# Load TSV using the sep keyword argument to set delimiter
tax_data = pd.read_csv("us_tax_data.tsv", sep="\t")
tax_data.head(4)
# Plot the total number of tax returns by income group
counts = data.groupby("agi_stub").N1.sum()
counts.plot.bar()
plt.show()
// LESSON 1.2 - MODIFYING FLAT FILE IMPORTS
limiting columns: tbh i dont get this
- choose columns to load with the usecols keyword argument
- accepts a list of column number or names, or a function to filter column names
col_names = ['STATEFIPS', 'STATE', 'zipcode', 'agi_stub', 'N1']
col_nums = [0, 1, 2, 3, 4] # using index of the columns
# Choose columns to load by name
tax_data_v1 = pd.read_csv('us_tax_data_2016.csv',
usecols=col_names) # only load the name that matches from the value of each index specified
# Choose columns to load by number
tax_data_v2 = pd.read_csv('us_tax_data_2016.csv',
usecols=col_nums) # only show columns 1-5 since col_nums only contains index 1-5
print(tax_data_v1.equals(tax_data_v2)) # this would print true
limiting rows
- limit the number of rows loaded with the nrows keyword argument
tax_data_first1000 = pd.read_csv('us_tax_data_2016.csv', nrows=1000)
print(tax_data_first1000.shape)
- use nrows and skiprows together to process a file in chunks
- skiprows accept a list of row numbers, a number of rows, or a function to filter rows
- set header-None so pandas knows there are no column names
tax_data_next500 = pd.read_csv("us_tax_data.csv,
nrows=500, # only show this amount of rows
skiprows=1000,
header=None) # no column names, without this the default value would be 0 which means the first row would be treated as column name
assigning column names
- supply column names by passing a list to the names argument
- the list must have a name for every column in your data
- if you only need to rename a few columns, do it after the import
col_names = list(tax_data_first1000) # converting it to list to be able to pass it into names
tax_data_next500 = pd.read_csv("us_tax_data.csv,
nrows=500,
skiprows=1000,
header=None # no column names, without this the default value would be 0 which means the first row would be treated as column name
names=col_names) # specify column names
// LESSON 1.3 - HANDLING ERRORS AND MISSING DATA
common flat file import issues
- column data types are wrong
- pandas automatically infers column data types, sometimes its wrong
- do not let pandas guess, specify the column data types
specifying data types on a column
tax_data = pd.read_csv("us_tax_data_2016.csv", dtype={"zipcode":str})
- values are missing
- pandas automatically interprets some values as missing or NA
- sometimes missing values are represented in ways that pandas won't catch such as with dummy codes (e.g. zip codes as 0 when they should be NA)
customizing missing data values
tax_data = pd.read_csv("us_tax_data_2016.csv",
na_values={"zipcode" : 0,}) # this means any zero in zipcode should be treadted sa NA value
print(tax_data[tax_data.zipcode.isna()]) # show zipcode with null values or missing values
- records that cannot be read by pandas | lines with errors
solution
- set error_bad_lines=False to skip unparseable records
- set warn_bad_lines=True to see messages when records are skipped
try:
# Set warn_bad_lines to issue warnings about bad records
data = pd.read_csv("vt_tax_data_2016_corrupt.csv",
error_bad_lines=False,
warn_bad_lines=True)
# View first 5 records
print(data.head())
except pd.io.common.CParserError:
print("Your data contained rows that could not be parsed.")
// LESSON 2.1 - INTRODUCTION TO SPREADSHEETS
- also known as excel files
- data stored in tabular form, with cells arranged in rows and columns
- unlike flat files, can have formatting anf formulas
- multiple spreadsheet can exist in a workbook
loading spreadsheets
- read_excel() function from pandas
import pandas as pd
# Read the Excel file
survey_data = pd.read_excel("fcc_survey.xlsx")
# View the first 5 lines of data
print(survey_data.head())
loading select columns and rows
- useful for non tabular information
- read_excel() has many kargs in common with read_csv()
- nrows: limit number of rows to load
- skiprows: specify number of rows or row numbers to skip
- usecols: choose columns by name, positional number, or letter (e.g. "A:P" | a to p)
#Read columns W-AB and AR of file, skipping metadata header
survery_data = pd.read_excel("fcc_survey_with_headers.xlsx",
skiprows=2, # start with row 3
usecols="W:AB, AR") # we want the columns from w to ab, idk what ar does
print(survey_data.head())
// LESSON 2.2 - GETTING DATA FROM MULTIPLE WORKSHEETS
- an excel file can be a workbook
- spreadsheet is different from excel file
- read_excel() loads the first sheet in an excel file by default
- use the sheet_name kargs to load other sheets
- specify spreadsheets by name and/or (zero-indexed) position number
- pass a list of names/numbers to load more than one sheet at a time
- any arguments passed to read_excel() apply to all sheets read
loading select sheets
# Get the second sheet by position index
survey_data_sheet2 = pd.read_excel('fcc_survey.xlsx',
sheet_name=1) # remember that index starts with 0
# Get the second sheet by name
survey_data_2017 = pd.read_excel('fcc_survey.xlsx',
sheet_name='2017')
print(survey_data_sheet2.equals(survey_data_2017))
# Graph where people would like to get a developer job
survey_prefs = survey_data_2017.groupby("JobPref").JobPref.count()
survey_prefs.plot.barh()
plt.show()
loading all sheets
- it preferred to use multiple calls if sheets contain very different data or layouts, that way, you can customize other arguments for each sheet
# Passing sheet_name=None to read_excel() reads all sheets in a workbook
survey_responses = pd.read_excel("fcc_survey.xlsx", sheet_name=None)
print(type(survey_responses))
# Loading multiple sheets, you could also load it with its index position (e.g. sheet_name = [0,1] or sheet_name = [0, "2017"])
survey_responses = pd.read_excel("fcc_survey.xlsx", sheet_name=["2016","2017"])
# survey_responses now contains an array of spreadsheet, we gonna loop thru it
for key, value in survey_responses.items():
print(key, type(value))
putting it all together
# Create empty data frame to hold all loaded sheets
all_responses = pd.DataFrame()
# Iterate through data frames in dictionary
for sheet_name, frame in survey_responses.items():
# Add a column so we know which year data is from
frame["Year"] = sheet_name
# Add each data frame to all_responses
all_responses = all_responses.append(frame)
# View years in data
print(all_responses.Year.unique())
another example of putting it all together
# Create an empty data frame
all_responses = pd.DataFrame()
# Set up for loop to iterate through values in responses
for df in responses.values():
# Print the number of rows being added
print("Adding {} rows".format(df.shape[0]))
# Append df to all_responses, assign result
all_responses = all_responses.append(df)
# Graph employment statuses in sample
counts = all_responses.groupby("EmploymentStatus").EmploymentStatus.count()
counts.plot.barh()
plt.show()
// LESSON 2.3 - MODIFYING IMPORTS: TRUE/FALSE DATA
- true or false could be 0 or 1, or yes or no
- pandas doesn't interpret booleans datatype, booleans are interpreted as float64
pandas and booleans
# Count True values
print(bootcamp_data.sum()) # NA is automatically false
#Count NAs
print(bootcamp_data.isna().sum())
specifying datatype as bools
- in float64(booleans default if not specified the datatype) anything iwth a value is automatically a true and those without value is false, in bool dtype anything without a value/NA are true
- unrecognized values are also set to true in bool datatype
# Load data with Boolean dtypes and custom T/F values
bool_data = pd.read_excel("fcc_survey_booleans.xlsx",
dtype={"AttendedBootcamp": bool,
"AttendedBootCampYesNo": bool,
"AttendedBootcampTF":bool,
"BootcampLoan": bool,
"LoanYesNo": bool,
"LoanTF": bool},
true_values=["Yes"], # specifying that yes values should be interpreted as true
false_values=["No"])
- na values are still set as true, we don't want to fake this so we might decide to keep a column as floats
boolean considerations
- are there missing values, or could there be in the future?
- how will this column be used in analysis
- what would happen if a value were incorrectly coded as True?
- could the data be modeled another way (e.g. as floats or integers)?
// LESSON 2.4 - MODIFYING IMPORTS: PARSING DATES
- dates and times have their own data type and internal representation
- datetime values can be translated into string represntations
- common set of codes to describe datetime string formatting
pandas and datetimes
- datetime columns are loaded as objects (string) by default
- specify that columns have datetimes with the parse_dates argument (not dtype)
- parse_dates can accept:
- a list of column names or numbers to parse
- a list of containing lists of columns to combine and parse
- a dictionary where keys are new column names and values are lists of columns to parse together
parsing dates
example legends
- Part1StartTime: complete data and time (e.g. 2016-03-29 21:23:13)
- Part1EndTime: comeplete date and time (e.g. 2016-03-29 21:45:18)
- Part2StartDate: only the date (e.g. 2016-03-29)
- Part2StartTime: only the time (e.g. 21:24:57)
- Part2EndTime: complete date and time but in a different format (e.g. 03292016 21:24:25)
# List columns of dates to parse
date_cols = ["Part1StartTime", "Part1EndTime"]
# Load file, parsing standard datetime columns
survey_df = pd.read_excel("fcc_survey.xlsx",
parse_dates=date_cols)
# Check data types of timestamp columns
print(survey_df[["Part1StartTime", # this would produce datetime64[ns] since we already parsed it
"Part1EndTime", # this would produce datetime64[ns] since we already parsed it
"Part2StartDate", # not yet datetime64 since we haven't parsed it yet
"Part2StartTime",
"Part2EndTime"]].dtypes)
# List columns of dates to parse
date_cols = ["Part1StartTime", "Part1EndTime",
[["Part2StartDate", "Part2StartTime"]]] # this would combine these two column thus producing just like the output for Part1StartTime
# Load file, parsing standard and split datetime columns
survey_df = pd.read_excel("fcc_survey.xlsx",
parse_dates=date_cols)
another way of parsing data
# List columns of dates to parse | just the same as above except you get to specify the column name( i think? )
date_cols = {"Part1Start": "Part1StartTime",
"Part1End": "Part1EndTime",
"Part2Start": ["Part2StartDate",
"Part2StartTime"]}
# Load file, parsing standard and split datetime columns
survey_df = pd.read_excel("fcc_survey.xlsx",
parse_dates=date_cols)
non-standard dates
- parse_dates doesn't work with non-standard datetime formats
- use pd.to_datetime() after loading data if parse_dates won't work
- to_datetime() arguments
- data frame and column to convert
- format: string representation of datetime format
datetime formatting
- just google this shit
parsing non-standard dates
format_string = "%m%d%Y %H:%M:%S" # to udnerstand this search datettime formatting
survey_df["Part2EndTime"] = pd.to_datetime(survey_df["Part2EndTime"],
format=format_string)
// LESSON 3.1 - INTRODUCTION TO DATABASES
relational databases
- tables can be linekd to each other via unique keys
- suport more data, multiple simultaenous users and data quality controls
- data types are specified for each column
- sql to interact with database
connecting to database
- create a way to conect to database
- query what you want from the database
creating a database engine
- use sqlalchemy and use it create_engine() method, this makes an engine to handle database connections
querying database
- pd.read_sql(query, engine) to load in data from a database
# Load pandas and sqlalchemy's create_engine
import pandas as pd
from sqlalchemy import create_engine
# Create database engine to manage connections
engine = create_engine("sqlite:///data.db")
# Load entire weather table with SQL
weather = pd.read_sql("SELECT * FROM weather", engine)
print(weather.head())
// LESSON 3.2 - REFINING IMPORTS WITH SQL QUERIES
sample use of selecting multiple columns
# Create database engine for data.db
engine = create_engine('sqlite:///data.db')
# Write query to get date, tmax, and tmin from weather
query = """
SELECT date,
tmax,
tmin
FROM weather;
"""
# Make a data frame by passing query and engine to read_sql()
temperatures = pd.read_sql(query, engine)
# View the resulting data frame
print(temperatures)
sample use of where clause
# Create query to get hpd311calls records about safety
query = """
SELECT *
FROM hpd311calls
WHERE complaint_type = 'SAFETY';
"""
# Query the database and assign result to safety_calls
safety_calls = pd.read_sql(query,engine)
# Graph the number of safety calls by borough
call_counts = safety_calls.groupby('borough').unique_key.count()
call_counts.plot.barh()
plt.show()
where clause with logical operator
# Create query for records with max temps <= 32 or snow >= 1
query = """
SELECT *
FROM weather
WHERE tmax <= 32
OR snow >= 1; # can also use AND
"""
# Query database and assign result to wintry_days
wintry_days = pd.read_sql(query, engine)
# View summary stats about the temperatures
print(wintry_days.describe())
// LESSON 3.3 - MORE COMPLEX SQL QUERIES
getting distincy values
- get unique values for one or more columns with SELECT DISTINCT
SYNTAX: SELECT DISTINCT [column names] FROM [table]
aggregate functions
- query a database directly for descriptive statistics
- aggregate functions
- SUM, AVG, MAX, MIN, COUNT
- SUM, AVG, MAX, MIN all take a single column name
- example: SELECT AVG(tmax) FROM weather;
- returns the avg of all of the attributes from column tmax
- COUNT
- get number of rows that meet query conditions
- SELECT COUNT(*) FROM [table]
- get number of unique values in a column
- SELECT COUNT(DISTINCT [column_names]) FROM [table_name]
group by
- aggregate functions calculate a single summary statistic be default
- you could think of this as bucketing, group records that has the same attributes from the column specified
- e.g. GROUP BY fruit_name
- this groups all of the fruit_name that has the same values
example code
# Create database engine
engine = create_engine("sqlite:///data.db")
# Write query to get plumbing call counts by borough
query = """SELECT borough, COUNT(*)
FROM hpd311calls
WHERE complaint_type = 'PLUMBING'
GROUP BY borough;"""
# Query databse and create data frame
plumbing_call_counts = pd.read_sql(query, engine)
query would then return grouped recods with the same value from columb borough, and value of their count
// LESSON 3.4 - LOADING MULTIPLE TABLES WITH JOINS
- relational database can be relate tables using keys
joining and filtering
/* this is an example of inner join, only the foreign keys avaiable in hpd311calls that matches with weather would be the result */
/* Get only heat/hot water calls and join in weather data */
SELECT *
FROM hpd311calls
JOIN weather
ON hpd311calls.created_date = weather.date
WHERE hpd311calls.complaint_type = 'HEAT/HOT WATER';
show everything from a single table and a specific column on another table
# Query to get hpd311calls and precipitation values
query = """
SELECT hpd311calls.*, weather.prcp
FROM hpd311calls
JOIN weather
ON hpd311calls.created_date = weather.date;"""
# Load query results into the leak_calls data frame
leak_calls = pd.read_sql(query, engine)
# View the data frame
print(leak_calls.head())
REVIEW
- order of keywords
- SELECT
- FROM
- JOIN
- WHERE
- GROUP BY
// LESSON 4.1 - INTRODUCTION TO JSON
- almost same format with javascript's object, like a python's dict
- not tabuler thus records desont need to specify null if an attribute has no value or missing value
- since json isnt tabular, pandas guesses how to arrange it in a table
json arrangement
record orientation
- have to repeat values, not memory friendly
example
[{
"age":"19",
"name":"Renz",
},
{
"age":"20",
"name":"Reny",
}]
column oriented
- more space-efficeint than record-oriented json, but hard to read tho
{
"age": {
"0":"19",
"1":"20",
},
"name": {
"0": "Renz",
"1": "Reny"
}
}
split oriented data
specifying orientation of json
import pandas as pd
death_causes = prd.read_json("nyc_death_json", orient="split")
// LESSON 4.2 - INTRO TO API
- defines how a application communicates with other program's server
- way to get data from an application without knowing database details
- most common source of json
- requests library for python but built in fetch for javascript
requests.get()
- to get data from a url
- kargs
- params:takes a dictionary of parameters and values to customize API request
- headers: takes a dictionary, can be used to provide user authentication to api
- response.json() to convert a json into a dict
making requests
import requests
import pandas as pd
api_url = "https://api.yelp.com/v3/businesses/search"
# Set up parameter dictionary according to documentation
# Params is the parameter to be used in fetching data
params = {"term": "bookstore", "location": "San Francisco"}
# Set up header dictionary w/ API key according to documentation
# API keys are employed to track and moderate API usage, make sure it is hidden
headers = {"Authorization": "Bearer {}".format(api_key)}
# Call the API
response = requests.get(api_url,
params=params,
headers=headers)
# Extract JSON data from the response
data = response.json()
# Load data to a data frame
cafes = pd.DataFrame(data["businesses"])
parsing responses
# Isolate the JSON data from the response object
data = response.json()
print(data)
# Load businesses data to a data frame
bookstores = pd.DataFrame(data["businesses"])
print(bookstores.head(2)) alias ... url0 city-lights-bookstore-san-francisco ... https://www.yelp.com/biz/city-lights-bookstore...1 alexander-book-company-san-francisco ... https://www.yelp.com/biz/alexander-book-compan...[2 rows x 16 columns]{'businesses': [{'id': '_rbF2ooLcMRA7Kh8neIr4g', 'alias': 'city-lights-bookstore-san-francisco', 'name': 'City L
// LESSON 4.3 - WORKING WITH NESTED JSON
nested json
- json contain objects with attribute-value pairs
- json is neseted when the value itself is an object
Unflattened JSON:
{‘user’ :{
‘Rachel’:{
‘UserID’:1717171717,
‘Email’: ‘[email protected]’,
‘friends’: [‘John’, ‘Jeremy’, ‘Emily’]
}
}
}
Flattened JSON:
{
‘user_Rachel_friends_2’: ‘Emily’,
‘user_Rachel_friends_0’: ‘John’,
‘user_Rachel_UserID’: 1717171717,
‘user_Rachel_Email’: ‘[email protected]’,
‘user_Rachel_friends_1’: ‘Jeremy’
}
pandas.io.sjon
- submodule has tools for reading and writing JSON
- json_normalize()
- takes a dictionary/list of dictionaryies (like pd.DataFrame() does)
- returns a flattened data frame
- default flattened column name pattern: attribute.nestedattribute
- choose a different separator with the sep argument
loading nested json data
import pandas as pd
import requests
from pandas.io.json import json_normalize
# Set up headers, parameters, and API endpoint
api_url = "https://api.yelp.com/v3/businesses/search"
headers = {"Authorization": "Bearer {}".format(api_key)}
params = {"term": "bookstore",
"location": "San Francisco"}
# Make the API call and extract the JSON data
response = requests.get(api_url,
headers=headers,
params=params)
data = response.json()
# Flatten data and load to data frame, with _ separators
bookstores = json_normalize(data["businesses"], sep="_")
deeply nested data: idk what's happening in here
- json_normalize()
- record_path: string/list of string attributes to nested data
- meta: list of other attributes to load to data frame
- meta_prefix: string to prefix to meta column names
# Flatten categories data, bring in business details
df = json_normalize(data["businesses"],
sep="_",
record_path="categories",
meta=["name", "alias", "rating",
["coordinates", "latitude"],
["coordinates", "longitude"]],
meta_prefix="biz_")
# look at the results it now looks like a spreadsheet table
// LESSON 4.4 - COMBINING MULTIPLE DATASETS
appending
- adding rows from one data frame to another
- append()
- syntax: df1.append(df2)
- set ignore_index to True to renumber rows
# Add an offset parameter to get cafes 51-100
params = {"term": "cafe",
"location": "NYC",
"sort_by": "rating",
"limit": 50,
"offset": 50}
result = requests.get(api_url, headers=headers, params=params)
next_50_cafes = json_normalize(result.json()["businesses"])
# Append the results, setting ignore_index to renumber rows
cafes = top_50_cafes.append(next_50_cafes, ignore_index=True)
merging
- combining datasets to add reltaed columns
- datasets have key column(s) with common values
- merge(), pandas version of a sql join
- df.merge() arguments
- second data frame to merge
- columns to mergon on
- on if names are the same in both data frames
- left_on and right_on if key names differ
- key columns should be the same data type
- default merge() behavior: return only values that are in both datasets (inner join)
- one record for each value match between data frames
# Merge crosswalk into cafes on their zip code fields
cafes_with_pumas = cafes.merge(crosswalk,
left_on="location_zip_code",
right_on="zipcode")
# Merge pop_data into cafes_with_pumas on puma field
cafes_with_pop = cafes_with_pumas.merge(pop_data, on="puma")
# View the data
print(cafes_with_pop.head())
// LESSON X.1 - IMPORTANT PANDAS FUNCTION
loading csv - pd.read_csv("file.csv")
loading different file format - pd.read_csv("file.tsv", sep="\t")
show 5 records - pd.read_csv("file.csv").head(5)
show how many rows or columns - pd.read_csv("file.csv").shape
filter columns to read/you could also specify name - pd.read_csv("file.csv", usecols=[0,1,2])
filter no. of rows to read - pd.read_csv("file.csv", nrows=1000)
print datatypes of each columns - print(pd.read_csv("file.csv").dtypes)
summary of statistical info - pd.read_json("file.json").describe()
// LESSON X.2 - NEXT THING TO STUDY
DATA VISUALIZATION WITH MATPLOTLIB OR SEABORN
DATA MANIPULATION WITH PANDAS
// LESSON X.3 - CONCLUSION - pandas
1 - loading flat files, modifying flat file imports, handling errors and missing data in a flat file
2 - loading spreadsheets, getting data from multiple worksheets, modifying true/false datatypes, modifying imports: parsing dates
3 - loading databases, sql queries, join
4 - loading json, intro to api, working with nested jsons, combining different data sources