Skip to content

Commit a8579ab

Browse files
authored
Examples p1 (ssciwr#4)
* format readme * update main example * more examples * flake8 other errors example * example 3 style * formatter examples
1 parent 524da27 commit a8579ab

File tree

8 files changed

+341
-8
lines changed

8 files changed

+341
-8
lines changed

Diff for: Material_Part1_PEP/PEP_right_or_wrong

+55-1
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,62 @@
1+
# Code alignment
2+
3+
Which of the below alignments are correct?
4+
5+
[]
6+
```
7+
abs_area = area_A + area_B +
8+
area_C + area_D
9+
```
10+
[]
11+
```
12+
abs_area = area_A + area_B
13+
+ area_C + area_D
14+
```
15+
16+
[]
17+
```
18+
result = my_function(area_A, area_B,
19+
area_C, area_D)
20+
```
21+
[]
22+
```
23+
result = my_function(area_A, area_B,
24+
area_C, area_D)
25+
```
26+
[]
27+
```
28+
result = my_function(
29+
area_A, area_B,
30+
area_C, area_D
31+
)
32+
```
33+
134
# Naming conventions
235

336
Which of the below naming conventions are correct?
437

538
- [ ] `class my-first-analysis:`
639
- [ ] `class my_first_analysis:`
740
- [ ] `class Myfirstanalysis:`
8-
- [ ] `class MyFirstAnalysis:`
41+
- [ ] `class MyFirstAnalysis:`
42+
43+
- [ ] `def calc_area(x):`
44+
- [ ] `def calc-area(x):`
45+
- [ ] `def calcarea(x):`
46+
- [ ] `def Calc_area(x):`
47+
48+
- [ ] `O = abs(x)`
49+
- [ ] `I = abs(x)`
50+
- [ ] `l = abs(x)`
51+
- [ ] `abs_x = abs(x)`
52+
53+
- [ ] `THRESHOLD = 0.1`
54+
- [ ] `threshold = 0.1`
55+
- [ ] `Threshold = 0.1`
56+
- [ ] `T = 0.1`
57+
58+
- [ ] ` list = my_areas`
59+
- [ ] ` list_ = my_areas`
60+
- [ ] ` __list__ = my_areas`
61+
- [ ] ` _list = my_areas`
62+

Diff for: Material_Part1_PEP/README.md

+3
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,9 @@ Make use of indentation when using continuation lines:
134134
- use two leading underscores to invoke name mangling for attributes that should not be used in subclasses of the parent class (`__only_parent`)
135135
- double leading and trailing underscores for "magic" objects (dunder methods) - `__init__`, `__str__`
136136

137+
**Task 1: Let's take a look at some [examples](./PEP_right_or_wrong).**
138+
139+
**Task 2: Work through the examples [in this folder](.). Correct the issues and (i) stage, commit and push the changes to your fork of the course repo, then open a pull request with respect to the original repository - then I can see the changes. (ii) Send me your changed files via email.**
137140

138141
## What is PEP 257?
139142

Diff for: Material_Part1_PEP/example1.py

+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
import os
2+
import glob
3+
4+
5+
6+
# find all png files in a folder
7+
def find_files(path=None, pattern="*.png", recursive=True, limit = 20) -> list:
8+
"""Find image files on the file system
9+
10+
:param path:
11+
The base directory where we are looking for the images. Defaults to None, which uses the XDG data directory if set or the current working directory otherwise.
12+
:param pattern:
13+
The naming pattern that the filename should match. Defaults to
14+
"*.png". Can be used to allow other patterns or to only include
15+
specific prefixes or suffixes.
16+
:param recursive:
17+
Whether to recurse into subdirectories.
18+
:param limit:
19+
The maximum number of images to be found. Defaults to 20.
20+
To return all images, set to None.
21+
"""
22+
if path is None:
23+
path = os.environ.get("XDG_DATA_HOME", ".")
24+
25+
result=list(glob.glob(f"{path}/{pattern}", recursive=recursive))
26+
27+
if limit is not None:
28+
result = result[:limit]
29+
30+
return result
31+
32+
if __name__=="__main__":
33+
list = find_files(path="./data/")
34+
print("Found files {}".format(list))

Diff for: Material_Part1_PEP/example2.py

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
import numpy as np
2+
3+
def area_circ(r_in ):
4+
"""Calculates the area of a circle with given radius.
5+
6+
:Input: The radius of the circle (float, >=0).
7+
:Returns: The area of the circle (float)."""
8+
if r_in<0:
9+
raise ValueError("The radius must be >= 0.")
10+
Kreis=np.pi*r_in**2
11+
print(
12+
"""The area of a circle with radius r = {:3.2f}cm is A = {:4.2f}cm2.""".format(
13+
r_in,Kreis
14+
)
15+
)
16+
return Kreis
17+
18+
if __name__ == "__main__":
19+
_ = area_circ(5.0)

Diff for: Material_Part1_PEP/example3.py

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
def validate_data_dict(data_dict):
2+
if not data_dict:
3+
raise ValueError("data_dict is empty")
4+
for something, otherthing in data_dict.items():
5+
if not otherthing:
6+
raise ValueError(f"The dict content under {something} is empty.")
7+
if not isinstance(otherthing, dict):
8+
raise ValueError(
9+
f"The content of {something} is not a dict but {type(otherthing)}."
10+
)
11+
12+
list = ["data", "file_type", "sofa", "paragraph"]
13+
missing_cats = []
14+
for category in list:
15+
if category not in list(otherthing.keys()):
16+
missing_cats.append(category)
17+
18+
if missing_cats:
19+
raise ValueError(f"Data dict is missing categories: {missing_cats}")
20+
21+
22+
if __name__ == "__main__":
23+
data_dict = {}
24+
data_dict = {"test": {"testing": "just testing"}}
25+
validate_data_dict(data_dict)

Diff for: Material_Part2_Linter/example3.py

+184
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
2+
map_expressions = {
3+
"KAT1MoralisierendesSegment": "KAT1-Moralisierendes Segment",
4+
"Moralwerte": "KAT2-Moralwerte",
5+
"KAT2Subjektive_Ausdrcke": "KAT2-Subjektive Ausdrücke",
6+
"Protagonistinnen2": "KAT3-Gruppe",
7+
"Protagonistinnen": "KAT3-Rolle",
8+
"Protagonistinnen3": "KAT3-own/other",
9+
"KommunikativeFunktion": "KAT4-Kommunikative Funktion",
10+
"Forderung": "KAT5-Forderung explizit",
11+
"KAT5Ausformulierung": "KAT5-Forderung implizit",
12+
"Kommentar": "KOMMENTAR",
13+
}
14+
15+
def validate_data_dict(data_dict):
16+
if not data_dict:
17+
raise ValueError("data_dict is empty")
18+
for data_file_name, data_file in data_dict.items():
19+
validation_list = ["data", "file_type", "sofa", "paragraph"]
20+
missing_cats = []
21+
for category in validation_list:
22+
if category not in list(data_file.keys()):
23+
missing_cats.append(category)
24+
25+
if missing_cats:
26+
raise ValueError(f"Data dict is missing categories: {missing_cats}")
27+
28+
29+
class AnalyseOccurrence:
30+
"""Contains statistical information methods about the data."""
31+
32+
def __init__(
33+
self,
34+
data_dict: dict,
35+
mode: str = "instances",
36+
file_names: str = None,
37+
) -> None:
38+
39+
validate_data_dict(data_dict)
40+
41+
self.mode = mode
42+
self.data_dict = data_dict
43+
self.mode_dict = {
44+
"instances": self.report_instances,
45+
"spans": self.report_spans,
46+
"span_index": self.report_index,
47+
}
48+
self.file_names = self._initialize_files(file_names)
49+
self.instance_dict = self._initialize_dict()
50+
# call the analysis method
51+
self.mode_dict[self.mode]()
52+
# map the df columns to the expressions given
53+
self.map_categories()
54+
55+
def _initialize_files(self, file_names: str) -> list:
56+
"""Helper method to get file names in list."""
57+
# get the file names from the global dict of dicts
58+
if file_names is None:
59+
file_names = list(self.data_dict.keys())
60+
# or use the file names that were passed explicitly
61+
elif isinstance(file_names, str):
62+
file_names = [file_names]
63+
return file_names
64+
65+
def _initialize_dict(self) -> defaultdict:
66+
"""Helper method to initialize dict."""
67+
return defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
68+
69+
def _initialize_df(self):
70+
"""Helper method to initialize data frame."""
71+
self.df = pd.DataFrame(self.instance_dict)
72+
self.df.index = self.df.index.set_names((["Main Category", "Sub Category"]))
73+
74+
def _get_categories(self, span_dict, file_name):
75+
"""Helper method to initialize a dict with the given main and sub categories."""
76+
for main_cat_key, main_cat_value in span_dict.items():
77+
for sub_cat_key, sub_cat_value in main_cat_value.items():
78+
# the tuple index makes it easy to convert the dict into a pandas dataframe
79+
self.instance_dict[file_name][(main_cat_key, sub_cat_key)] = len(
80+
sub_cat_value
81+
)
82+
return self.instance_dict
83+
84+
def _add_total(self):
85+
"""Helper method to set additional headers in data frame."""
86+
self.df.loc[("total instances", "with invalid"), :] = self.df.sum(axis=0).values
87+
self.df.loc[("total instances", "without invalid"), :] = (
88+
self.df.loc[("total instances", "with invalid"), :].values
89+
- self.df.loc["KAT1MoralisierendesSegment", "Keine Moralisierung"].values
90+
)
91+
92+
def _clean_df(self):
93+
"""Helper method to sort data frame and clean up values."""
94+
self.df = self.df.sort_values(
95+
by=[
96+
"Main Category",
97+
"Sub Category",
98+
# self.file_names[0],
99+
],
100+
ascending=True,
101+
)
102+
# fill NaN with 0 for instances or None for spans
103+
if self.mode == "instances":
104+
self.df = self.df.fillna(0)
105+
if self.mode == "spans":
106+
self.df = self.df.replace({np.nan: None})
107+
# remove quotes - not sure if this is necessary
108+
# self.df = self.df.applymap(lambda x: x.replace('"','') if isinstance(x, str) else x)
109+
110+
def report_instances(self):
111+
"""Reports number of occurrences of a category per text source."""
112+
# instances reports the number of occurrences
113+
# filename: main_cat: sub_cat: instances
114+
for file_name in self.file_names:
115+
span_dict = self.data_dict[file_name]["data"]
116+
# initilize total instances rows for easier setting later.
117+
# only for mode instances
118+
self.instance_dict[file_name][("total instances", "with invalid")] = 0
119+
self.instance_dict[file_name][("total instances", "without invalid")] = 0
120+
self.instance_dict = self._get_categories(span_dict, file_name)
121+
# initialize data frame
122+
self._initialize_df()
123+
# add rows for total instances
124+
# only do this for mode instances
125+
self._add_total()
126+
127+
def report_spans(self):
128+
"""Reports spans of a category per text source."""
129+
# span reports the spans of the annotations separated by separator-token
130+
self.instance_dict = self._get_categories(
131+
self.data_dict[self.file_names[0]]["data"], self.file_names[0]
132+
)
133+
self._initialize_df()
134+
self.df[:] = self.df[:].astype("object")
135+
for file_name in self.file_names:
136+
span_dict = self.data_dict[file_name]["data"]
137+
span_text = self.data_dict[file_name]["sofa"]
138+
for main_cat_key, main_cat_value in span_dict.items():
139+
for sub_cat_key in main_cat_value.keys():
140+
# save the span begin and end character index for further analysis
141+
# span_dict[main_cat_key][sub_cat_key] =
142+
# find the text for each span
143+
span_annotated_text = [
144+
span_text[span["begin"] : span["end"]]
145+
for span in span_dict[main_cat_key][sub_cat_key]
146+
]
147+
# clean the spans from #
148+
span_annotated_text = [
149+
span.replace("#", "") for span in span_annotated_text
150+
]
151+
# clean the spans from "
152+
# span_annotated_text = [
153+
# span.replace('"', "") for span in span_annotated_text
154+
# ]
155+
# convert list to &-separated spans
156+
span_annotated_text = " & ".join(span_annotated_text)
157+
self.df.at[
158+
(main_cat_key, sub_cat_key),
159+
file_name,
160+
] = span_annotated_text
161+
162+
def report_index(self):
163+
self.report_instances()
164+
self.df[:] = self.df[:].astype("object")
165+
for file_name in self.file_names:
166+
span_dict = self.data_dict[file_name]["data"]
167+
for main_cat_key, main_cat_value in span_dict.items():
168+
for sub_cat_key in main_cat_value.keys():
169+
# report the beginning and end of each span as a tuple
170+
span_list = [
171+
(span["begin"], span["end"])
172+
for span in span_dict[main_cat_key][sub_cat_key]
173+
]
174+
self.df.at[
175+
(main_cat_key, sub_cat_key),
176+
file_name,
177+
] = span_list
178+
179+
def map_categories(self):
180+
self.df = self.df.rename(map_expressions)
181+
self._clean_df()
182+
183+
184+

Diff for: Material_Part3_Formatter/example2.py

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
import numpy as np
2+
3+
def area_circ(r_in):
4+
"""Calculates the area of a circle with given radius.
5+
6+
:Input: The radius of the circle (float, >=0).
7+
:Returns: The area of the circle (float)."""
8+
if r_in<0:
9+
raise ValueError("The radius must be >= 0.")
10+
Kreis=np.pi*r_in**2
11+
print(
12+
"""The area of a circle with radius r = {:3.2f}cm is A = {:4.2f}cm2.""".format(
13+
r_in,Kreis
14+
)
15+
)
16+
return Kreis
17+
18+
if __name__ == "__main__":
19+
_ = area_circ(5.0)

Diff for: README.md

+2-7
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,7 @@ Material for the course "Python best practices", Scientific Software Center, Hei
66

77
Inga Ulusoy, October 2022
88

9-
Python has rapidly advanced to the most popular programming language in science and
10-
research. From data analysis to simulation and preparation of publications, all can be done in
11-
Python with appropriate libraries and implementing own modules. We will discuss Python
12-
Enhancement Proposals (PEP) and how these can help you write cleaner code. Common
13-
pitfalls in Python will be explained with examples. We will demonstrate typical “bad
14-
programming” and how to code the examples in a more pythonic way.
9+
Python has rapidly advanced to the most popular programming language in science and research. From data analysis to simulation and preparation of publications, all can be done in Python with appropriate libraries and implementing own modules. We will discuss most important Python Enhancement Proposals (PEP) and how these can help you write cleaner code. You will learn how to use a code linter and code formatter. Common pitfalls in Python will be explained with examples. We will demonstrate typical “bad programming” and how to code the examples in a more pythonic way.
1510

1611
## Prerequisites
1712
Basic Python knowledge is required. Participants need a laptop/PC with camera and
@@ -33,6 +28,6 @@ Course date: Nov 8th 2022, 9:00AM - 1:00PM
3328

3429
1. [PEP recommendations](Material_Part1_PEP/README.md)
3530
1. [Linting](Material_Part2_Linter/README.md)
36-
1. Code formatting
31+
1. [Code formatting](Material_Part3_Formatter/README.md)
3732
1. Write better code: Examples
3833
1. Write better code: Pitfalls

0 commit comments

Comments
 (0)