티스토리 뷰
Python Pandas DataFrame 데이터 활용2
loc vs iloc 비교, Nan (결측값), unique, drop
BASIC_03.Pandas_DataFrame.ipynb¶
In [1]:
import pandas as pd
import numpy as np
In [2]:
# DataFrame 생성시 index, column 설정안하면 정수형 숫자로 기본값이 입력된다.
df = pd.DataFrame(np.random.randn(10,3)) # 랜덤값 10x3개의 np array객체입력
df.columns = ["one", "two", "three"]
df
Out[2]:
one | two | three | |
---|---|---|---|
0 | -1.141389 | -1.597709 | -0.593118 |
1 | -0.896467 | 0.323917 | -0.015165 |
2 | 1.093343 | -2.020293 | 0.099026 |
3 | 1.971081 | -0.379402 | 0.751828 |
4 | -0.569713 | -0.061500 | -0.489205 |
5 | -0.502000 | 0.545283 | 0.037430 |
6 | 1.407763 | 2.647069 | 0.217019 |
7 | -1.627089 | -0.195426 | 0.157814 |
8 | -1.544826 | -1.033046 | -0.704031 |
9 | -0.919720 | -0.968175 | -1.126654 |
In [3]:
# columns 리스트가 매우 많을때 쉽게 나열해 볼수 있는 방법
df_col_check = pd.DataFrame(df.columns)
df_col_check
Out[3]:
0 | |
---|---|
0 | one |
1 | two |
2 | three |
loc vs iloc¶
iloc[ 행선택, 열선택 ] loc [ 행조건, 열선택 ]
iloc[row_num, col_num] row_num , col_num 에는 int값만 들어갈 수 있다. loc[row_select, col_select] row_select, col_select 에는 조건문이 들어갈 수 있다.
In [4]:
# shift + tab + tab 클릭시 api정보 확인 할 수 있으나
# 가끔 안뜨는 경우가 있어서 필요시 print해서 확인 한다.
print(df.loc.__doc__)
Access a group of rows and columns by label(s) or a boolean array. ``.loc[]`` is primarily label based, but may also be used with a boolean array. Allowed inputs are: - A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is interpreted as a *label* of the index, and **never** as an integer position along the index). - A list or array of labels, e.g. ``['a', 'b', 'c']``. - A slice object with labels, e.g. ``'a':'f'``. .. warning:: Note that contrary to usual python slices, **both** the start and the stop are included - A boolean array of the same length as the axis being sliced, e.g. ``[True, False, True]``. - An alignable boolean Series. The index of the key will be aligned before masking. - An alignable Index. The Index of the returned selection will be the input. - A ``callable`` function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above) See more at :ref:`Selection by Label <indexing.label>`. Raises ------ KeyError If any items are not found. IndexingError If an indexed key is passed and its index is unalignable to the frame index. See Also -------- DataFrame.at : Access a single value for a row/column label pair. DataFrame.iloc : Access group of rows and columns by integer position(s). DataFrame.xs : Returns a cross-section (row(s) or column(s)) from the Series/DataFrame. Series.loc : Access group of values using labels. Examples -------- **Getting values** >>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]], ... index=['cobra', 'viper', 'sidewinder'], ... columns=['max_speed', 'shield']) >>> df max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8 Single label. Note this returns the row as a Series. >>> df.loc['viper'] max_speed 4 shield 5 Name: viper, dtype: int64 List of labels. Note using ``[[]]`` returns a DataFrame. >>> df.loc[['viper', 'sidewinder']] max_speed shield viper 4 5 sidewinder 7 8 Single label for row and column >>> df.loc['cobra', 'shield'] 2 Slice with labels for row and single label for column. As mentioned above, note that both the start and stop of the slice are included. >>> df.loc['cobra':'viper', 'max_speed'] cobra 1 viper 4 Name: max_speed, dtype: int64 Boolean list with the same length as the row axis >>> df.loc[[False, False, True]] max_speed shield sidewinder 7 8 Alignable boolean Series: >>> df.loc[pd.Series([False, True, False], ... index=['viper', 'sidewinder', 'cobra'])] max_speed shield sidewinder 7 8 Index (same behavior as ``df.reindex``) >>> df.loc[pd.Index(["cobra", "viper"], name="foo")] max_speed shield foo cobra 1 2 viper 4 5 Conditional that returns a boolean Series >>> df.loc[df['shield'] > 6] max_speed shield sidewinder 7 8 Conditional that returns a boolean Series with column labels specified >>> df.loc[df['shield'] > 6, ['max_speed']] max_speed sidewinder 7 Callable that returns a boolean Series >>> df.loc[lambda df: df['shield'] == 8] max_speed shield sidewinder 7 8 **Setting values** Set value for all items matching the list of labels >>> df.loc[['viper', 'sidewinder'], ['shield']] = 50 >>> df max_speed shield cobra 1 2 viper 4 50 sidewinder 7 50 Set value for an entire row >>> df.loc['cobra'] = 10 >>> df max_speed shield cobra 10 10 viper 4 50 sidewinder 7 50 Set value for an entire column >>> df.loc[:, 'max_speed'] = 30 >>> df max_speed shield cobra 30 10 viper 30 50 sidewinder 30 50 Set value for rows matching callable condition >>> df.loc[df['shield'] > 35] = 0 >>> df max_speed shield cobra 30 10 viper 0 0 sidewinder 0 0 **Getting values on a DataFrame with an index that has integer labels** Another example using integers for the index >>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]], ... index=[7, 8, 9], columns=['max_speed', 'shield']) >>> df max_speed shield 7 1 2 8 4 5 9 7 8 Slice with integer labels for rows. As mentioned above, note that both the start and stop of the slice are included. >>> df.loc[7:9] max_speed shield 7 1 2 8 4 5 9 7 8 **Getting values with a MultiIndex** A number of examples using a DataFrame with a MultiIndex >>> tuples = [ ... ('cobra', 'mark i'), ('cobra', 'mark ii'), ... ('sidewinder', 'mark i'), ('sidewinder', 'mark ii'), ... ('viper', 'mark ii'), ('viper', 'mark iii') ... ] >>> index = pd.MultiIndex.from_tuples(tuples) >>> values = [[12, 2], [0, 4], [10, 20], ... [1, 4], [7, 1], [16, 36]] >>> df = pd.DataFrame(values, columns=['max_speed', 'shield'], index=index) >>> df max_speed shield cobra mark i 12 2 mark ii 0 4 sidewinder mark i 10 20 mark ii 1 4 viper mark ii 7 1 mark iii 16 36 Single label. Note this returns a DataFrame with a single index. >>> df.loc['cobra'] max_speed shield mark i 12 2 mark ii 0 4 Single index tuple. Note this returns a Series. >>> df.loc[('cobra', 'mark ii')] max_speed 0 shield 4 Name: (cobra, mark ii), dtype: int64 Single label for row and column. Similar to passing in a tuple, this returns a Series. >>> df.loc['cobra', 'mark i'] max_speed 12 shield 2 Name: (cobra, mark i), dtype: int64 Single tuple. Note using ``[[]]`` returns a DataFrame. >>> df.loc[[('cobra', 'mark ii')]] max_speed shield cobra mark ii 0 4 Single tuple for the index with a single label for the column >>> df.loc[('cobra', 'mark i'), 'shield'] 2 Slice from index tuple to single label >>> df.loc[('cobra', 'mark i'):'viper'] max_speed shield cobra mark i 12 2 mark ii 0 4 sidewinder mark i 10 20 mark ii 1 4 viper mark ii 7 1 mark iii 16 36 Slice from index tuple to index tuple >>> df.loc[('cobra', 'mark i'):('viper', 'mark ii')] max_speed shield cobra mark i 12 2 mark ii 0 4 sidewinder mark i 10 20 mark ii 1 4 viper mark ii 7 1 Please see the :ref:`user guide<advanced.advanced_hierarchical>` for more details and explanations of advanced indexing.
In [5]:
df.loc[1]
Out[5]:
one -0.896467 two 0.323917 three -0.015165 Name: 1, dtype: float64
In [6]:
#df.loc[ df.index == 9 , :] # row값 9인 모든 column값, : => min:max 으로 All을 의미함
df.loc[ 0:1, :]
Out[6]:
one | two | three | |
---|---|---|---|
0 | -1.141389 | -1.597709 | -0.593118 |
1 | -0.896467 | 0.323917 | -0.015165 |
In [7]:
df.loc[0:3, ['one', 'two']]
Out[7]:
one | two | |
---|---|---|
0 | -1.141389 | -1.597709 |
1 | -0.896467 | 0.323917 |
2 | 1.093343 | -2.020293 |
3 | 1.971081 | -0.379402 |
─────────────────────────────────────¶
In [8]:
print(df.iloc.__doc__)
Purely integer-location based indexing for selection by position. ``.iloc[]`` is primarily integer position based (from ``0`` to ``length-1`` of the axis), but may also be used with a boolean array. Allowed inputs are: - An integer, e.g. ``5``. - A list or array of integers, e.g. ``[4, 3, 0]``. - A slice object with ints, e.g. ``1:7``. - A boolean array. - A ``callable`` function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don't have a reference to the calling object, but would like to base your selection on some value. - A tuple of row and column indexes. The tuple elements consist of one of the above inputs, e.g. ``(0, 1)``. ``.iloc`` will raise ``IndexError`` if a requested indexer is out-of-bounds, except *slice* indexers which allow out-of-bounds indexing (this conforms with python/numpy *slice* semantics). See more at :ref:`Selection by Position <indexing.integer>`. See Also -------- DataFrame.iat : Fast integer location scalar accessor. DataFrame.loc : Purely label-location based indexer for selection by label. Series.iloc : Purely integer-location based indexing for selection by position. Examples -------- >>> mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4}, ... {'a': 100, 'b': 200, 'c': 300, 'd': 400}, ... {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }] >>> df = pd.DataFrame(mydict) >>> df a b c d 0 1 2 3 4 1 100 200 300 400 2 1000 2000 3000 4000 **Indexing just the rows** With a scalar integer. >>> type(df.iloc[0]) <class 'pandas.core.series.Series'> >>> df.iloc[0] a 1 b 2 c 3 d 4 Name: 0, dtype: int64 With a list of integers. >>> df.iloc[[0]] a b c d 0 1 2 3 4 >>> type(df.iloc[[0]]) <class 'pandas.core.frame.DataFrame'> >>> df.iloc[[0, 1]] a b c d 0 1 2 3 4 1 100 200 300 400 With a `slice` object. >>> df.iloc[:3] a b c d 0 1 2 3 4 1 100 200 300 400 2 1000 2000 3000 4000 With a boolean mask the same length as the index. >>> df.iloc[[True, False, True]] a b c d 0 1 2 3 4 2 1000 2000 3000 4000 With a callable, useful in method chains. The `x` passed to the ``lambda`` is the DataFrame being sliced. This selects the rows whose index label even. >>> df.iloc[lambda x: x.index % 2 == 0] a b c d 0 1 2 3 4 2 1000 2000 3000 4000 **Indexing both axes** You can mix the indexer types for the index and columns. Use ``:`` to select the entire axis. With scalar integers. >>> df.iloc[0, 1] 2 With lists of integers. >>> df.iloc[[0, 2], [1, 3]] b d 0 2 4 2 2000 4000 With `slice` objects. >>> df.iloc[1:3, 0:3] a b c 1 100 200 300 2 1000 2000 3000 With a boolean array whose length matches the columns. >>> df.iloc[:, [True, False, True, False]] a c 0 1 3 1 100 300 2 1000 3000 With a callable function that expects the Series or DataFrame. >>> df.iloc[:, lambda df: [0, 2]] a c 0 1 3 1 100 300 2 1000 3000
In [9]:
df.iloc[2]
Out[9]:
one 1.093343 two -2.020293 three 0.099026 Name: 2, dtype: float64
In [10]:
df.iloc[1, 2] # row 1, cal 2 에 해당하는 값
Out[10]:
-0.015164940108393593
In [11]:
#df.iloc[:, 0:3]
df.iloc[:3, :2] # 0~3, 0~2, df.iloc[0:3, 0:2] 와 같음
Out[11]:
one | two | |
---|---|---|
0 | -1.141389 | -1.597709 |
1 | -0.896467 | 0.323917 |
2 | 1.093343 | -2.020293 |
In [12]:
# Column 조건문으로 행 추출하기
df["two"].values > 0.3
#df[df["two"].values > 0.3]
#df["two"].isin([-0.643932, 0.417865])
Out[12]:
array([False, True, False, False, False, True, True, False, False, False])
NaN 결측값 다루기¶
np.nan은 결측값을 의미한다. Null과 같은 의미이다.
In [13]:
df["four"] = [np.nan, np.nan, np.nan, 4, 5, 6, 7, 8, 9, 10]
df
Out[13]:
one | two | three | four | |
---|---|---|---|---|
0 | -1.141389 | -1.597709 | -0.593118 | NaN |
1 | -0.896467 | 0.323917 | -0.015165 | NaN |
2 | 1.093343 | -2.020293 | 0.099026 | NaN |
3 | 1.971081 | -0.379402 | 0.751828 | 4.0 |
4 | -0.569713 | -0.061500 | -0.489205 | 5.0 |
5 | -0.502000 | 0.545283 | 0.037430 | 6.0 |
6 | 1.407763 | 2.647069 | 0.217019 | 7.0 |
7 | -1.627089 | -0.195426 | 0.157814 | 8.0 |
8 | -1.544826 | -1.033046 | -0.704031 | 9.0 |
9 | -0.919720 | -0.968175 | -1.126654 | 10.0 |
In [14]:
df.dropna() # 데이터 내에 있는 모든 결측치 제거
Out[14]:
one | two | three | four | |
---|---|---|---|---|
3 | 1.971081 | -0.379402 | 0.751828 | 4.0 |
4 | -0.569713 | -0.061500 | -0.489205 | 5.0 |
5 | -0.502000 | 0.545283 | 0.037430 | 6.0 |
6 | 1.407763 | 2.647069 | 0.217019 | 7.0 |
7 | -1.627089 | -0.195426 | 0.157814 | 8.0 |
8 | -1.544826 | -1.033046 | -0.704031 | 9.0 |
9 | -0.919720 | -0.968175 | -1.126654 | 10.0 |
In [15]:
df.fillna(value=-1) # nan값에 값 넣기
Out[15]:
one | two | three | four | |
---|---|---|---|---|
0 | -1.141389 | -1.597709 | -0.593118 | -1.0 |
1 | -0.896467 | 0.323917 | -0.015165 | -1.0 |
2 | 1.093343 | -2.020293 | 0.099026 | -1.0 |
3 | 1.971081 | -0.379402 | 0.751828 | 4.0 |
4 | -0.569713 | -0.061500 | -0.489205 | 5.0 |
5 | -0.502000 | 0.545283 | 0.037430 | 6.0 |
6 | 1.407763 | 2.647069 | 0.217019 | 7.0 |
7 | -1.627089 | -0.195426 | 0.157814 | 8.0 |
8 | -1.544826 | -1.033046 | -0.704031 | 9.0 |
9 | -0.919720 | -0.968175 | -1.126654 | 10.0 |
In [16]:
df.isnull()
Out[16]:
one | two | three | four | |
---|---|---|---|---|
0 | False | False | False | True |
1 | False | False | False | True |
2 | False | False | False | True |
3 | False | False | False | False |
4 | False | False | False | False |
5 | False | False | False | False |
6 | False | False | False | False |
7 | False | False | False | False |
8 | False | False | False | False |
9 | False | False | False | False |
In [17]:
df.isnull()["four"]
Out[17]:
0 True 1 True 2 True 3 False 4 False 5 False 6 False 7 False 8 False 9 False Name: four, dtype: bool
In [18]:
df.loc[df.isnull()["four"],:]
Out[18]:
one | two | three | four | |
---|---|---|---|---|
0 | -1.141389 | -1.597709 | -0.593118 | NaN |
1 | -0.896467 | 0.323917 | -0.015165 | NaN |
2 | 1.093343 | -2.020293 | 0.099026 | NaN |
unique¶
In [19]:
df["four"].unique() # 특정 column에 중복값을 제외한 unique한 값 array
Out[19]:
array([nan, 4., 5., 6., 7., 8., 9., 10.])
In [30]:
np.unique(df["four"])
Out[30]:
array([ 4., 5., 6., 7., 8., 9., 10., nan])
In [31]:
len(np.unique(df["four"]))
Out[31]:
8
drop¶
In [20]:
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
{'a': 100, 'b': 200, 'c': 300, 'd': 400},
{'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
df_dict = pd.DataFrame(mydict)
df_dict
Out[20]:
a | b | c | d | |
---|---|---|---|---|
0 | 1 | 2 | 3 | 4 |
1 | 100 | 200 | 300 | 400 |
2 | 1000 | 2000 | 3000 | 4000 |
In [21]:
df_dict = df_dict.drop(0, axis=0)
df_dict
Out[21]:
a | b | c | d | |
---|---|---|---|---|
1 | 100 | 200 | 300 | 400 |
2 | 1000 | 2000 | 3000 | 4000 |
In [22]:
df_dict = df_dict.drop("b", axis=1)
df_dict
Out[22]:
a | c | d | |
---|---|---|---|
1 | 100 | 300 | 400 |
2 | 1000 | 3000 | 4000 |
'Computer > Python' 카테고리의 다른 글
Python Pandas 기본 (Series, DataFrame) (0) | 2023.08.31 |
---|---|
Python Jupyter 설치 (0) | 2023.08.28 |
댓글