WEBVTT

1
00:00:00.000 --> 00:00:01.290
In this lesson,

2
00:00:01.290 --> 00:00:04.740
we will learn about data handling and management.

3
00:00:04.740 --> 00:00:07.140
Data handling and management is used

4
00:00:07.140 --> 00:00:10.770
to safeguard data throughout its entire lifecycle

5
00:00:10.770 --> 00:00:15.480
to ensure its confidentiality, integrity, and availability.

6
00:00:15.480 --> 00:00:18.690
Data handling and management concepts include

7
00:00:18.690 --> 00:00:22.380
Data Anonymization and Data Sanitization.

8
00:00:22.380 --> 00:00:25.230
Data Anonymization modifies data

9
00:00:25.230 --> 00:00:29.730
to prevent the identification of individual data owners

10
00:00:29.730 --> 00:00:33.030
while maintaining its ability to be analyzed.

11
00:00:33.030 --> 00:00:35.520
Data Sanitization, on the other hand,

12
00:00:35.520 --> 00:00:38.670
is the process of securely destroying data

13
00:00:38.670 --> 00:00:42.720
to ensure that sensitive information is completely removed

14
00:00:42.720 --> 00:00:46.890
before the storage media is disposed of or repurposed.

15
00:00:46.890 --> 00:00:49.920
Let's learn more about Data Anonymization

16
00:00:49.920 --> 00:00:52.050
and Data Sanitization.

17
00:00:52.050 --> 00:00:55.260
First, we have Data Anonymization.

18
00:00:55.260 --> 00:00:57.480
When dealing with data security,

19
00:00:57.480 --> 00:01:00.390
it's important to protect sensitive information

20
00:01:00.390 --> 00:01:03.360
collected from your users or customers.

21
00:01:03.360 --> 00:01:07.350
One way to achieve this is through Data Anonymization,

22
00:01:07.350 --> 00:01:09.060
which removes the ability

23
00:01:09.060 --> 00:01:12.630
to uniquely identify individuals in data

24
00:01:12.630 --> 00:01:16.380
while still allowing the data to be used for analysis.

25
00:01:16.380 --> 00:01:20.760
Data Anonymization, also called de-identification,

26
00:01:20.760 --> 00:01:24.900
involves stripping out identifying details from the data

27
00:01:24.900 --> 00:01:29.760
before sharing it within or outside of your organization.

28
00:01:29.760 --> 00:01:34.050
For example, if you track how many students pass or fail

29
00:01:34.050 --> 00:01:35.730
a certification exam,

30
00:01:35.730 --> 00:01:39.390
you only need the overall results for analysis,

31
00:01:39.390 --> 00:01:42.120
not individual student names.

32
00:01:42.120 --> 00:01:46.500
In this way, Data Anonymization helps you gather insights

33
00:01:46.500 --> 00:01:49.620
without compromising anyone's privacy.

34
00:01:49.620 --> 00:01:53.700
So Data Anonymization makes sensitive data

35
00:01:53.700 --> 00:01:56.100
usable for other purposes

36
00:01:56.100 --> 00:02:00.030
by removing personal information associated with it.

37
00:02:00.030 --> 00:02:03.810
Common techniques for Data Anonymization include

38
00:02:03.810 --> 00:02:06.180
data masking, tokenization,

39
00:02:06.180 --> 00:02:09.150
as well as aggregation and banding.

40
00:02:09.150 --> 00:02:12.300
Data masking replaces real data

41
00:02:12.300 --> 00:02:14.850
with generic or placeholder values,

42
00:02:14.850 --> 00:02:17.730
preserving the original data's format.

43
00:02:17.730 --> 00:02:21.600
For example, substituting actual credit card numbers

44
00:02:21.600 --> 00:02:25.470
with a series of placeholder digits protects the data

45
00:02:25.470 --> 00:02:27.270
while maintaining its structure,

46
00:02:27.270 --> 00:02:29.880
ensuring sensitive information is hidden

47
00:02:29.880 --> 00:02:32.370
when accessed without a valid reason.

48
00:02:32.370 --> 00:02:35.340
Data masking is not reversible.

49
00:02:35.340 --> 00:02:38.730
Tokenization involves replacing sensitive data

50
00:02:38.730 --> 00:02:42.330
with unique tokens that reference the original data

51
00:02:42.330 --> 00:02:44.280
stored somewhere else.

52
00:02:44.280 --> 00:02:48.120
Unlike other methods, tokenization is reversible,

53
00:02:48.120 --> 00:02:51.150
allowing the re-identification of data

54
00:02:51.150 --> 00:02:54.270
when needed for specific business purposes.

55
00:02:54.270 --> 00:02:58.710
For instance, using random IDs for social security numbers

56
00:02:58.710 --> 00:03:02.040
keeps the data safe, but accessible when necessary,

57
00:03:02.040 --> 00:03:05.220
because those random IDs would be mapped

58
00:03:05.220 --> 00:03:08.340
to the original social security numbers.

59
00:03:08.340 --> 00:03:12.420
Next, aggregation and banding involve grouping data

60
00:03:12.420 --> 00:03:15.090
to generalize individual information,

61
00:03:15.090 --> 00:03:19.020
making it difficult to identify specific people.

62
00:03:19.020 --> 00:03:21.750
This technique is often used in studies

63
00:03:21.750 --> 00:03:24.300
where results are reported for groups

64
00:03:24.300 --> 00:03:26.340
rather than individuals.

65
00:03:26.340 --> 00:03:29.970
For example, consider a study on employee salaries

66
00:03:29.970 --> 00:03:31.020
at a company.

67
00:03:31.020 --> 00:03:34.320
Instead of listing each employee's exact salary,

68
00:03:34.320 --> 00:03:36.600
the data could be grouped into ranges

69
00:03:36.600 --> 00:03:40.470
such as $30,000 to $40,000 per year,

70
00:03:40.470 --> 00:03:44.760
or $40,000 to $50,000 per year, and so on.

71
00:03:44.760 --> 00:03:47.400
By reporting that 20 employees earn

72
00:03:47.400 --> 00:03:50.820
between $40,000 and $50,000 per year,

73
00:03:50.820 --> 00:03:55.020
the study hides individual salaries within a broader group.

74
00:03:55.020 --> 00:03:56.700
This makes it impossible

75
00:03:56.700 --> 00:04:00.120
to identify any one person's exact income,

76
00:04:00.120 --> 00:04:02.670
protecting individual privacy,

77
00:04:02.670 --> 00:04:06.000
while still providing useful data insights.

78
00:04:06.000 --> 00:04:09.990
In the end, Data Anonymization is not foolproof,

79
00:04:09.990 --> 00:04:12.780
and improperly anonymized data sets

80
00:04:12.780 --> 00:04:14.910
can still be re-identified.

81
00:04:14.910 --> 00:04:19.620
Here's a real-world example of some poor Data Anonymization.

82
00:04:19.620 --> 00:04:22.710
I worked at a company that conducted an employee survey

83
00:04:22.710 --> 00:04:25.590
to get some honest feedback about the company.

84
00:04:25.590 --> 00:04:28.860
They made the survey anonymous by not asking for names,

85
00:04:28.860 --> 00:04:31.800
or any directly identifiable information,

86
00:04:31.800 --> 00:04:34.050
hoping that employees would feel comfortable

87
00:04:34.050 --> 00:04:35.640
sharing their thoughts.

88
00:04:35.640 --> 00:04:37.590
The survey included questions like

89
00:04:37.590 --> 00:04:40.290
how they felt about their job, their pay,

90
00:04:40.290 --> 00:04:42.930
and their overall job satisfaction,

91
00:04:42.930 --> 00:04:45.360
along with a few demographic questions

92
00:04:45.360 --> 00:04:47.910
like age, gender, and marital status.

93
00:04:47.910 --> 00:04:49.860
Initially, it seemed harmless,

94
00:04:49.860 --> 00:04:52.830
and the company assumed that data would stay anonymous,

95
00:04:52.830 --> 00:04:54.990
but after reviewing the results,

96
00:04:54.990 --> 00:04:58.500
they noticed a few good ratings and one really bad one,

97
00:04:58.500 --> 00:05:00.570
a one-star rating with a comment saying

98
00:05:00.570 --> 00:05:02.280
the CEO was terrible.

99
00:05:02.280 --> 00:05:04.260
Curious about who left the comment,

100
00:05:04.260 --> 00:05:06.780
the CEO looked at the demographic data.

101
00:05:06.780 --> 00:05:10.980
The person was a married woman between 35 and 40 years old.

102
00:05:10.980 --> 00:05:14.310
In a large company, this could mean dozens of people,

103
00:05:14.310 --> 00:05:16.440
but this company was small,

104
00:05:16.440 --> 00:05:19.170
and only one employee fit that description.

105
00:05:19.170 --> 00:05:20.940
It was the CEO's wife.

106
00:05:20.940 --> 00:05:23.070
She often joked around in surveys

107
00:05:23.070 --> 00:05:25.080
knowing she would not get fired.

108
00:05:25.080 --> 00:05:28.500
The lesson here is that even with aggregation and banding,

109
00:05:28.500 --> 00:05:31.890
small groups can make re-identification easy.

110
00:05:31.890 --> 00:05:34.080
If the dataset isn't large enough,

111
00:05:34.080 --> 00:05:35.970
anonymization may fail,

112
00:05:35.970 --> 00:05:38.580
and people can still be identified.

113
00:05:38.580 --> 00:05:41.640
Second, we have Data Sanitization.

114
00:05:41.640 --> 00:05:45.510
Data Sanitization ensures that data is completely removed

115
00:05:45.510 --> 00:05:46.920
from the storage media

116
00:05:46.920 --> 00:05:51.090
before it is disposed of, repurposed, or transferred.

117
00:05:51.090 --> 00:05:52.920
This process is critical

118
00:05:52.920 --> 00:05:55.740
for maintaining data confidentiality,

119
00:05:55.740 --> 00:05:59.640
and preventing unauthorized access to sensitive information

120
00:05:59.640 --> 00:06:02.190
that might otherwise be recoverable.

121
00:06:02.190 --> 00:06:04.230
Imagine a company has decided

122
00:06:04.230 --> 00:06:06.780
to upgrade its computer systems.

123
00:06:06.780 --> 00:06:10.110
This upgrade includes replacing old hard drives

124
00:06:10.110 --> 00:06:12.930
used to store sensitive customer information,

125
00:06:12.930 --> 00:06:16.260
financial records, and other internal documents.

126
00:06:16.260 --> 00:06:19.980
The company needs to dispose of these drives securely

127
00:06:19.980 --> 00:06:23.670
to prevent any unauthorized access to the data.

128
00:06:23.670 --> 00:06:26.580
Simply deleting the files, or formatting the drives,

129
00:06:26.580 --> 00:06:29.310
might seem like enough, but it's not.

130
00:06:29.310 --> 00:06:33.720
Deleted files are often just marked as free space,

131
00:06:33.720 --> 00:06:37.830
and formatting only removes the file system references

132
00:06:37.830 --> 00:06:41.040
without actually erasing the underlying data.

133
00:06:41.040 --> 00:06:43.080
With the right recovery tools,

134
00:06:43.080 --> 00:06:44.910
a hacker could still retrieve

135
00:06:44.910 --> 00:06:49.170
valuable and confidential information from these drives.

136
00:06:49.170 --> 00:06:50.730
To solve this problem,

137
00:06:50.730 --> 00:06:53.880
a company implements Data Sanitization methods

138
00:06:53.880 --> 00:06:56.820
to ensure that all data is permanently erased,

139
00:06:56.820 --> 00:06:58.950
and cannot be retrieved.

140
00:06:58.950 --> 00:07:00.810
One of the most common methods

141
00:07:00.810 --> 00:07:03.570
they would use is overwriting.

142
00:07:03.570 --> 00:07:07.230
In this process, the data on hard drives is overwritten

143
00:07:07.230 --> 00:07:10.110
with random patterns, zeros or ones,

144
00:07:10.110 --> 00:07:13.020
three, seven, or even more times.

145
00:07:13.020 --> 00:07:15.810
This process replaces the original data

146
00:07:15.810 --> 00:07:17.760
with meaningless information,

147
00:07:17.760 --> 00:07:20.940
effectively destroying any traces of the files

148
00:07:20.940 --> 00:07:22.950
that were originally stored.

149
00:07:22.950 --> 00:07:26.820
Overwriting isn't the only way to sanitize data.

150
00:07:26.820 --> 00:07:31.020
There are also tools available for secure Data Sanitization.

151
00:07:31.020 --> 00:07:35.670
One popular tool is Microsoft's Sysinternals SDelete,

152
00:07:35.670 --> 00:07:37.830
which securely destroys files

153
00:07:37.830 --> 00:07:40.470
by overwriting them with random data.

154
00:07:40.470 --> 00:07:42.660
For instance, when a company needs

155
00:07:42.660 --> 00:07:45.060
to securely delete customer records

156
00:07:45.060 --> 00:07:47.310
before recycling a computer,

157
00:07:47.310 --> 00:07:49.530
SDelete can be used to make sure

158
00:07:49.530 --> 00:07:53.490
that no sensitive information remains on the device.

159
00:07:53.490 --> 00:07:57.270
Another method of sanitization is Degaussing.

160
00:07:57.270 --> 00:08:00.510
Degaussing uses strong magnetic fields

161
00:08:00.510 --> 00:08:04.020
to disrupt the data stored on magnetic media,

162
00:08:04.020 --> 00:08:05.940
like hard drives or tapes,

163
00:08:05.940 --> 00:08:08.550
effectively scrambling the information,

164
00:08:08.550 --> 00:08:10.920
so it cannot be read again.

165
00:08:10.920 --> 00:08:13.770
Degaussing only works on magnetic media,

166
00:08:13.770 --> 00:08:16.890
so it will not work on a solid-state drive.

167
00:08:16.890 --> 00:08:20.430
Also degaussing permanently destroys the media.

168
00:08:20.430 --> 00:08:22.260
It can't be used again.

169
00:08:22.260 --> 00:08:26.460
Physical destruction is another form of Data Sanitization.

170
00:08:26.460 --> 00:08:29.430
This method involves shredding, crushing,

171
00:08:29.430 --> 00:08:32.160
or incinerating storage devices

172
00:08:32.160 --> 00:08:35.910
to make data recovery and reuse impossible.

173
00:08:35.910 --> 00:08:39.120
For example, when banks need to dispose

174
00:08:39.120 --> 00:08:42.630
of old servers containing sensitive financial data,

175
00:08:42.630 --> 00:08:45.180
they often use industrial shredders

176
00:08:45.180 --> 00:08:49.290
to break down the hard drives into little tiny pieces.

177
00:08:49.290 --> 00:08:51.510
By physically destroying the drives,

178
00:08:51.510 --> 00:08:55.650
they eliminate any chance of the data being reconstructed.

179
00:08:55.650 --> 00:08:59.640
Data Sanitization is critical not only for businesses,

180
00:08:59.640 --> 00:09:01.800
but also for individuals

181
00:09:01.800 --> 00:09:04.740
who want to protect their personal information.

182
00:09:04.740 --> 00:09:09.390
For example, before selling or donating an old smartphone,

183
00:09:09.390 --> 00:09:12.570
it's important to conduct a factory reset,

184
00:09:12.570 --> 00:09:15.480
and then run a Data Sanitization tool

185
00:09:15.480 --> 00:09:19.350
to ensure that no personal photos, contacts, or messages

186
00:09:19.350 --> 00:09:21.300
remain on the device.

187
00:09:21.300 --> 00:09:24.990
This prevents anyone from accessing sensitive information

188
00:09:24.990 --> 00:09:27.600
after the phone has changed hands.

189
00:09:27.600 --> 00:09:31.440
So remember, data handling and management

190
00:09:31.440 --> 00:09:34.830
should protect data throughout its entire lifecycle,

191
00:09:34.830 --> 00:09:39.720
and ensure confidentiality, integrity, and availability.

192
00:09:39.720 --> 00:09:42.840
Data handling and management concepts include

193
00:09:42.840 --> 00:09:46.830
Data Anonymization and Data Sanitization.

194
00:09:46.830 --> 00:09:49.860
Data Anonymization modifies data

195
00:09:49.860 --> 00:09:54.150
to prevent the identification of individual data owners

196
00:09:54.150 --> 00:09:57.150
while still allowing the data to be analyzed,

197
00:09:57.150 --> 00:10:00.900
making it valuable without compromising privacy.

198
00:10:00.900 --> 00:10:03.540
Data Sanitization, on the other hand,

199
00:10:03.540 --> 00:10:06.270
involves securely destroying data

200
00:10:06.270 --> 00:10:10.410
to ensure that sensitive information is completely removed

201
00:10:10.410 --> 00:10:12.990
before storage media is disposed of,

202
00:10:12.990 --> 00:10:15.270
repurposed, or transferred.

203
00:10:15.270 --> 00:10:20.190
Together these processes help protect sensitive information,

204
00:10:20.190 --> 00:10:23.970
maintain privacy, and ensure data security

205
00:10:23.970 --> 00:10:26.253
in enterprise environments.

