Das Problem
In einem Projekt am FoKoS versuchen wir, Twittermeldungen zum Thema "fracking" (eine kontrovers diskutierte Fördertechnik für Erdgas) zu analysieren. Pro Woche gibt es in etwa 20.000 solcher Tweets.Wir würden gerne wissen, welche Meldungen pro oder contra fracking sind. Morry hat versucht, dies über eine Sentiment-Analyse zu ermitteln und beschreibt im Folgenden die Methode:
Measuring semantics of a sentence or expression…
The idea comes from the availability of WordNet Dictionary[i]. Within this dictionary the definition of each adjective includes all its synonyms. So, it is possible to measure the distance of two adjectives as the number of the synonyms between two different adjectives. It is expected the more similar the adjectives are the less distance between them would exist. As example the distance between words “honest” and “good” is just two words, but the distance between “honest” and “bad” is 6.
Jaap Kamps[ii]
developed this idea to measure the positivity, potency and activeness of an
adjective as below:
Where d(word1 ,
word2) stands for the distance between two words on WordNet
dictionary. Thease measurements range from -1 to +1 and the greater the
measurement, the closer the adjective is to either good/bad, strong/weak, or active/passive.
So we extend this idea to measure
the semantic of the expressions. In an expression there are some few words. We
want to evaluate all the words in an expression and take an average to measure
the expression in an algorithmic way. Take these two samples,
“My life is nice and pretty”:
“I have always been unlucky and
sinister”
The drawback of this
method is that it cannot capture the negative meanings like “I
have never been unlucky and sinister”.
We used this method to check if
this is useful to tag the Tweets on Twitter about “fracking”. We tagged 500 Tweets manually and
then tried to see if this method is useful to measure the semantic of the
Tweets as we tagged. We separated negative tagged Tweets (Tweets opposing fracking)
from positive tagged Tweets (Tweets supporting fracking). Then we plotted the
distributions of the three measurements. We expected to find some significant
difference between the two tagged groups of Tweets. From the plots below you
see we did not manage to do this, as the Tweets are more complex and they
contain ambiguous expressions and negative tenses which this method cannot
distinguish.
ACT
plot of positive tagged Tweets and negative tagged Tweets
EVA
plot of positive tagged Tweets and negative tagged Tweets
POT
plot of positive tagged Tweets and negative tagged Tweets
Morteza Shahrezaye
[i] http://wordnet.princeton.edu/
[ii] http://dare.uva.nl/document/154122
Was lief schief?
Der Datensatz enthält Tweets wie:
"US Fracking Boom Creating Crisis of Illegal Toxic Dumping http://t.co/J4vXYSBwrc"
"Fracking Foes Cringe as Unions Back Drilling Boom http://t.co/htrynwJMTG"
"#HumanRightsHere's What #Fracking Can Do to Your #Health"Ob die Sprache eher negativ oder positiv ist, ist daher kein Indikator für die Haltung des Sprechers gegenüber Fracking. Außerdem enthalten viele Tweets gar keine Adjektive...
Kommentare
Kommentar veröffentlichen