티스토리 뷰
Python Pandas DataFrame 데이터 활용2

loc vs iloc 비교, Nan (결측값), unique, drop
BASIC_03.Pandas_DataFrame.ipynb¶
In [1]:
import pandas as pd
import numpy as np
In [2]:
# DataFrame 생성시 index, column 설정안하면 정수형 숫자로 기본값이 입력된다.
df = pd.DataFrame(np.random.randn(10,3)) # 랜덤값 10x3개의 np array객체입력
df.columns = ["one", "two", "three"]
df
Out[2]:
| one | two | three | |
|---|---|---|---|
| 0 | -1.141389 | -1.597709 | -0.593118 |
| 1 | -0.896467 | 0.323917 | -0.015165 |
| 2 | 1.093343 | -2.020293 | 0.099026 |
| 3 | 1.971081 | -0.379402 | 0.751828 |
| 4 | -0.569713 | -0.061500 | -0.489205 |
| 5 | -0.502000 | 0.545283 | 0.037430 |
| 6 | 1.407763 | 2.647069 | 0.217019 |
| 7 | -1.627089 | -0.195426 | 0.157814 |
| 8 | -1.544826 | -1.033046 | -0.704031 |
| 9 | -0.919720 | -0.968175 | -1.126654 |
In [3]:
# columns 리스트가 매우 많을때 쉽게 나열해 볼수 있는 방법
df_col_check = pd.DataFrame(df.columns)
df_col_check
Out[3]:
| 0 | |
|---|---|
| 0 | one |
| 1 | two |
| 2 | three |
loc vs iloc¶
iloc[ 행선택, 열선택 ] loc [ 행조건, 열선택 ]
iloc[row_num, col_num] row_num , col_num 에는 int값만 들어갈 수 있다. loc[row_select, col_select] row_select, col_select 에는 조건문이 들어갈 수 있다.
In [4]:
# shift + tab + tab 클릭시 api정보 확인 할 수 있으나
# 가끔 안뜨는 경우가 있어서 필요시 print해서 확인 한다.
print(df.loc.__doc__)
Access a group of rows and columns by label(s) or a boolean array.
``.loc[]`` is primarily label based, but may also be used with a
boolean array.
Allowed inputs are:
- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
interpreted as a *label* of the index, and **never** as an
integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.
.. warning:: Note that contrary to usual python slices, **both** the
start and the stop are included
- A boolean array of the same length as the axis being sliced,
e.g. ``[True, False, True]``.
- An alignable boolean Series. The index of the key will be aligned before
masking.
- An alignable Index. The Index of the returned selection will be the input.
- A ``callable`` function with one argument (the calling Series or
DataFrame) and that returns valid output for indexing (one of the above)
See more at :ref:`Selection by Label <indexing.label>`.
Raises
------
KeyError
If any items are not found.
IndexingError
If an indexed key is passed and its index is unalignable to the frame index.
See Also
--------
DataFrame.at : Access a single value for a row/column label pair.
DataFrame.iloc : Access group of rows and columns by integer position(s).
DataFrame.xs : Returns a cross-section (row(s) or column(s)) from the
Series/DataFrame.
Series.loc : Access group of values using labels.
Examples
--------
**Getting values**
>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
... index=['cobra', 'viper', 'sidewinder'],
... columns=['max_speed', 'shield'])
>>> df
max_speed shield
cobra 1 2
viper 4 5
sidewinder 7 8
Single label. Note this returns the row as a Series.
>>> df.loc['viper']
max_speed 4
shield 5
Name: viper, dtype: int64
List of labels. Note using ``[[]]`` returns a DataFrame.
>>> df.loc[['viper', 'sidewinder']]
max_speed shield
viper 4 5
sidewinder 7 8
Single label for row and column
>>> df.loc['cobra', 'shield']
2
Slice with labels for row and single label for column. As mentioned
above, note that both the start and stop of the slice are included.
>>> df.loc['cobra':'viper', 'max_speed']
cobra 1
viper 4
Name: max_speed, dtype: int64
Boolean list with the same length as the row axis
>>> df.loc[[False, False, True]]
max_speed shield
sidewinder 7 8
Alignable boolean Series:
>>> df.loc[pd.Series([False, True, False],
... index=['viper', 'sidewinder', 'cobra'])]
max_speed shield
sidewinder 7 8
Index (same behavior as ``df.reindex``)
>>> df.loc[pd.Index(["cobra", "viper"], name="foo")]
max_speed shield
foo
cobra 1 2
viper 4 5
Conditional that returns a boolean Series
>>> df.loc[df['shield'] > 6]
max_speed shield
sidewinder 7 8
Conditional that returns a boolean Series with column labels specified
>>> df.loc[df['shield'] > 6, ['max_speed']]
max_speed
sidewinder 7
Callable that returns a boolean Series
>>> df.loc[lambda df: df['shield'] == 8]
max_speed shield
sidewinder 7 8
**Setting values**
Set value for all items matching the list of labels
>>> df.loc[['viper', 'sidewinder'], ['shield']] = 50
>>> df
max_speed shield
cobra 1 2
viper 4 50
sidewinder 7 50
Set value for an entire row
>>> df.loc['cobra'] = 10
>>> df
max_speed shield
cobra 10 10
viper 4 50
sidewinder 7 50
Set value for an entire column
>>> df.loc[:, 'max_speed'] = 30
>>> df
max_speed shield
cobra 30 10
viper 30 50
sidewinder 30 50
Set value for rows matching callable condition
>>> df.loc[df['shield'] > 35] = 0
>>> df
max_speed shield
cobra 30 10
viper 0 0
sidewinder 0 0
**Getting values on a DataFrame with an index that has integer labels**
Another example using integers for the index
>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
... index=[7, 8, 9], columns=['max_speed', 'shield'])
>>> df
max_speed shield
7 1 2
8 4 5
9 7 8
Slice with integer labels for rows. As mentioned above, note that both
the start and stop of the slice are included.
>>> df.loc[7:9]
max_speed shield
7 1 2
8 4 5
9 7 8
**Getting values with a MultiIndex**
A number of examples using a DataFrame with a MultiIndex
>>> tuples = [
... ('cobra', 'mark i'), ('cobra', 'mark ii'),
... ('sidewinder', 'mark i'), ('sidewinder', 'mark ii'),
... ('viper', 'mark ii'), ('viper', 'mark iii')
... ]
>>> index = pd.MultiIndex.from_tuples(tuples)
>>> values = [[12, 2], [0, 4], [10, 20],
... [1, 4], [7, 1], [16, 36]]
>>> df = pd.DataFrame(values, columns=['max_speed', 'shield'], index=index)
>>> df
max_speed shield
cobra mark i 12 2
mark ii 0 4
sidewinder mark i 10 20
mark ii 1 4
viper mark ii 7 1
mark iii 16 36
Single label. Note this returns a DataFrame with a single index.
>>> df.loc['cobra']
max_speed shield
mark i 12 2
mark ii 0 4
Single index tuple. Note this returns a Series.
>>> df.loc[('cobra', 'mark ii')]
max_speed 0
shield 4
Name: (cobra, mark ii), dtype: int64
Single label for row and column. Similar to passing in a tuple, this
returns a Series.
>>> df.loc['cobra', 'mark i']
max_speed 12
shield 2
Name: (cobra, mark i), dtype: int64
Single tuple. Note using ``[[]]`` returns a DataFrame.
>>> df.loc[[('cobra', 'mark ii')]]
max_speed shield
cobra mark ii 0 4
Single tuple for the index with a single label for the column
>>> df.loc[('cobra', 'mark i'), 'shield']
2
Slice from index tuple to single label
>>> df.loc[('cobra', 'mark i'):'viper']
max_speed shield
cobra mark i 12 2
mark ii 0 4
sidewinder mark i 10 20
mark ii 1 4
viper mark ii 7 1
mark iii 16 36
Slice from index tuple to index tuple
>>> df.loc[('cobra', 'mark i'):('viper', 'mark ii')]
max_speed shield
cobra mark i 12 2
mark ii 0 4
sidewinder mark i 10 20
mark ii 1 4
viper mark ii 7 1
Please see the :ref:`user guide<advanced.advanced_hierarchical>`
for more details and explanations of advanced indexing.
In [5]:
df.loc[1]
Out[5]:
one -0.896467 two 0.323917 three -0.015165 Name: 1, dtype: float64
In [6]:
#df.loc[ df.index == 9 , :] # row값 9인 모든 column값, : => min:max 으로 All을 의미함
df.loc[ 0:1, :]
Out[6]:
| one | two | three | |
|---|---|---|---|
| 0 | -1.141389 | -1.597709 | -0.593118 |
| 1 | -0.896467 | 0.323917 | -0.015165 |
In [7]:
df.loc[0:3, ['one', 'two']]
Out[7]:
| one | two | |
|---|---|---|
| 0 | -1.141389 | -1.597709 |
| 1 | -0.896467 | 0.323917 |
| 2 | 1.093343 | -2.020293 |
| 3 | 1.971081 | -0.379402 |
─────────────────────────────────────¶
In [8]:
print(df.iloc.__doc__)
Purely integer-location based indexing for selection by position.
``.iloc[]`` is primarily integer position based (from ``0`` to
``length-1`` of the axis), but may also be used with a boolean
array.
Allowed inputs are:
- An integer, e.g. ``5``.
- A list or array of integers, e.g. ``[4, 3, 0]``.
- A slice object with ints, e.g. ``1:7``.
- A boolean array.
- A ``callable`` function with one argument (the calling Series or
DataFrame) and that returns valid output for indexing (one of the above).
This is useful in method chains, when you don't have a reference to the
calling object, but would like to base your selection on some value.
- A tuple of row and column indexes. The tuple elements consist of one of the
above inputs, e.g. ``(0, 1)``.
``.iloc`` will raise ``IndexError`` if a requested indexer is
out-of-bounds, except *slice* indexers which allow out-of-bounds
indexing (this conforms with python/numpy *slice* semantics).
See more at :ref:`Selection by Position <indexing.integer>`.
See Also
--------
DataFrame.iat : Fast integer location scalar accessor.
DataFrame.loc : Purely label-location based indexer for selection by label.
Series.iloc : Purely integer-location based indexing for
selection by position.
Examples
--------
>>> mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
... {'a': 100, 'b': 200, 'c': 300, 'd': 400},
... {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
>>> df = pd.DataFrame(mydict)
>>> df
a b c d
0 1 2 3 4
1 100 200 300 400
2 1000 2000 3000 4000
**Indexing just the rows**
With a scalar integer.
>>> type(df.iloc[0])
<class 'pandas.core.series.Series'>
>>> df.iloc[0]
a 1
b 2
c 3
d 4
Name: 0, dtype: int64
With a list of integers.
>>> df.iloc[[0]]
a b c d
0 1 2 3 4
>>> type(df.iloc[[0]])
<class 'pandas.core.frame.DataFrame'>
>>> df.iloc[[0, 1]]
a b c d
0 1 2 3 4
1 100 200 300 400
With a `slice` object.
>>> df.iloc[:3]
a b c d
0 1 2 3 4
1 100 200 300 400
2 1000 2000 3000 4000
With a boolean mask the same length as the index.
>>> df.iloc[[True, False, True]]
a b c d
0 1 2 3 4
2 1000 2000 3000 4000
With a callable, useful in method chains. The `x` passed
to the ``lambda`` is the DataFrame being sliced. This selects
the rows whose index label even.
>>> df.iloc[lambda x: x.index % 2 == 0]
a b c d
0 1 2 3 4
2 1000 2000 3000 4000
**Indexing both axes**
You can mix the indexer types for the index and columns. Use ``:`` to
select the entire axis.
With scalar integers.
>>> df.iloc[0, 1]
2
With lists of integers.
>>> df.iloc[[0, 2], [1, 3]]
b d
0 2 4
2 2000 4000
With `slice` objects.
>>> df.iloc[1:3, 0:3]
a b c
1 100 200 300
2 1000 2000 3000
With a boolean array whose length matches the columns.
>>> df.iloc[:, [True, False, True, False]]
a c
0 1 3
1 100 300
2 1000 3000
With a callable function that expects the Series or DataFrame.
>>> df.iloc[:, lambda df: [0, 2]]
a c
0 1 3
1 100 300
2 1000 3000
In [9]:
df.iloc[2]
Out[9]:
one 1.093343 two -2.020293 three 0.099026 Name: 2, dtype: float64
In [10]:
df.iloc[1, 2] # row 1, cal 2 에 해당하는 값
Out[10]:
-0.015164940108393593
In [11]:
#df.iloc[:, 0:3]
df.iloc[:3, :2] # 0~3, 0~2, df.iloc[0:3, 0:2] 와 같음
Out[11]:
| one | two | |
|---|---|---|
| 0 | -1.141389 | -1.597709 |
| 1 | -0.896467 | 0.323917 |
| 2 | 1.093343 | -2.020293 |
In [12]:
# Column 조건문으로 행 추출하기
df["two"].values > 0.3
#df[df["two"].values > 0.3]
#df["two"].isin([-0.643932, 0.417865])
Out[12]:
array([False, True, False, False, False, True, True, False, False,
False])
NaN 결측값 다루기¶
np.nan은 결측값을 의미한다. Null과 같은 의미이다.
In [13]:
df["four"] = [np.nan, np.nan, np.nan, 4, 5, 6, 7, 8, 9, 10]
df
Out[13]:
| one | two | three | four | |
|---|---|---|---|---|
| 0 | -1.141389 | -1.597709 | -0.593118 | NaN |
| 1 | -0.896467 | 0.323917 | -0.015165 | NaN |
| 2 | 1.093343 | -2.020293 | 0.099026 | NaN |
| 3 | 1.971081 | -0.379402 | 0.751828 | 4.0 |
| 4 | -0.569713 | -0.061500 | -0.489205 | 5.0 |
| 5 | -0.502000 | 0.545283 | 0.037430 | 6.0 |
| 6 | 1.407763 | 2.647069 | 0.217019 | 7.0 |
| 7 | -1.627089 | -0.195426 | 0.157814 | 8.0 |
| 8 | -1.544826 | -1.033046 | -0.704031 | 9.0 |
| 9 | -0.919720 | -0.968175 | -1.126654 | 10.0 |
In [14]:
df.dropna() # 데이터 내에 있는 모든 결측치 제거
Out[14]:
| one | two | three | four | |
|---|---|---|---|---|
| 3 | 1.971081 | -0.379402 | 0.751828 | 4.0 |
| 4 | -0.569713 | -0.061500 | -0.489205 | 5.0 |
| 5 | -0.502000 | 0.545283 | 0.037430 | 6.0 |
| 6 | 1.407763 | 2.647069 | 0.217019 | 7.0 |
| 7 | -1.627089 | -0.195426 | 0.157814 | 8.0 |
| 8 | -1.544826 | -1.033046 | -0.704031 | 9.0 |
| 9 | -0.919720 | -0.968175 | -1.126654 | 10.0 |
In [15]:
df.fillna(value=-1) # nan값에 값 넣기
Out[15]:
| one | two | three | four | |
|---|---|---|---|---|
| 0 | -1.141389 | -1.597709 | -0.593118 | -1.0 |
| 1 | -0.896467 | 0.323917 | -0.015165 | -1.0 |
| 2 | 1.093343 | -2.020293 | 0.099026 | -1.0 |
| 3 | 1.971081 | -0.379402 | 0.751828 | 4.0 |
| 4 | -0.569713 | -0.061500 | -0.489205 | 5.0 |
| 5 | -0.502000 | 0.545283 | 0.037430 | 6.0 |
| 6 | 1.407763 | 2.647069 | 0.217019 | 7.0 |
| 7 | -1.627089 | -0.195426 | 0.157814 | 8.0 |
| 8 | -1.544826 | -1.033046 | -0.704031 | 9.0 |
| 9 | -0.919720 | -0.968175 | -1.126654 | 10.0 |
In [16]:
df.isnull()
Out[16]:
| one | two | three | four | |
|---|---|---|---|---|
| 0 | False | False | False | True |
| 1 | False | False | False | True |
| 2 | False | False | False | True |
| 3 | False | False | False | False |
| 4 | False | False | False | False |
| 5 | False | False | False | False |
| 6 | False | False | False | False |
| 7 | False | False | False | False |
| 8 | False | False | False | False |
| 9 | False | False | False | False |
In [17]:
df.isnull()["four"]
Out[17]:
0 True 1 True 2 True 3 False 4 False 5 False 6 False 7 False 8 False 9 False Name: four, dtype: bool
In [18]:
df.loc[df.isnull()["four"],:]
Out[18]:
| one | two | three | four | |
|---|---|---|---|---|
| 0 | -1.141389 | -1.597709 | -0.593118 | NaN |
| 1 | -0.896467 | 0.323917 | -0.015165 | NaN |
| 2 | 1.093343 | -2.020293 | 0.099026 | NaN |
unique¶
In [19]:
df["four"].unique() # 특정 column에 중복값을 제외한 unique한 값 array
Out[19]:
array([nan, 4., 5., 6., 7., 8., 9., 10.])
In [30]:
np.unique(df["four"])
Out[30]:
array([ 4., 5., 6., 7., 8., 9., 10., nan])
In [31]:
len(np.unique(df["four"]))
Out[31]:
8
drop¶
In [20]:
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
{'a': 100, 'b': 200, 'c': 300, 'd': 400},
{'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
df_dict = pd.DataFrame(mydict)
df_dict
Out[20]:
| a | b | c | d | |
|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 |
| 1 | 100 | 200 | 300 | 400 |
| 2 | 1000 | 2000 | 3000 | 4000 |
In [21]:
df_dict = df_dict.drop(0, axis=0)
df_dict
Out[21]:
| a | b | c | d | |
|---|---|---|---|---|
| 1 | 100 | 200 | 300 | 400 |
| 2 | 1000 | 2000 | 3000 | 4000 |
In [22]:
df_dict = df_dict.drop("b", axis=1)
df_dict
Out[22]:
| a | c | d | |
|---|---|---|---|
| 1 | 100 | 300 | 400 |
| 2 | 1000 | 3000 | 4000 |
'Computer > Python' 카테고리의 다른 글
| Python Pandas 기본 (Series, DataFrame) (0) | 2023.08.31 |
|---|---|
| Python Jupyter 설치 (0) | 2023.08.28 |
댓글