#kpop Analysis PART 1: How often do TWICE members succeed each other when singing their Korean songs? #DataViz.
Get the data in Kaggle
here
This is Part 1 of a 2-part data analysis post - you may view Part 2
here
UPDATE (Sep 2023): I updated the viz to include recent TWICE songs: Alcohol Free (2021), Scientist (2021), Talk That Talk (2022) and Set Me Free (2023) in the visualization. I also added the special release single--The Best Thing I Ever Did (2018), which is also considered a title track.
As some of you may know, I am a big fan of the South Korean girl group TWICE. I listen to them when I need an energy boost after being drained when coding and they never fail to motivate me. When I created this blog, I promised myself that I’d do a data project based on their songs, and finally I’ve got some results to share!
Below is the completed visualization for the first part of this data science project. This is a chord diagram of TWICE’s line successions in their Korean singles. As data-to-viz.com defines it, a chord diagram shows connections between several objects represented by sectors on an outer circle. Arcs are drawn from one sector to another if the objects are related to each other, and the size of the arc is proportional to the importance of the relation.
I invite you to hover on the arcs and sectors and check out whats written on them!
Looks neat, doesnt it? Lets break it down below to see how its made.
What is line succession?#
So what data could we gather from their songs? The most obvious would be to measure the amount of time each member spends when singing their lines in each of their songs. This is often visualized by a bar/pie chart of total singing duration per member.
Fellow fans can really get creative when visualizing this–you could see their work by searching for “line distribution” videos in youtube.
Heated debates often happen when fans see these visualizations.They often go along the lines of: “Why is it that member X always has more lines than the others? Member Y deserves more lines!".
I think this line of thinking is overly simplistic and limiting because:
-
The total singing time of each member, on its own, is not a sufficient basis to quantify a member’s “relevance” in a group. For example, members have group roles (e.g. main/lead dancer) which may not translate to more singing lines.
-
At this level, nothing is revealed about the group’s creative dynamics that manifest in each song. What determines the order of which the members would sing? How does the group spice up their act by introducing variation and contrast? And why their songs have a unique TWICE flavor that distinguishes them from other k-pop girl group songs?
Like most art forms, a song is greater than the sum of its parts, and I want to capture that complexity. So instead of line distribution, I thought I should analyze
TWICE's line succession— how song lines are transferred from one member to another
So instead of individual lines, we now count succession pairs, defined as any two members whose lines come after one another in one song.
To illustrate, using our dataset over this section of TWICE’s song Cry for Me (2020):
start_time | end_time | line | vocal1 |
---|---|---|---|
00:00:25,770 | 00:00:39,823 | You don’t know me L O V E or hatred, ibyeol daeshin nan sunjinhan misoman oneuldo ne pume angillae | Mina |
00:00:39,823 | 00:00:41,458 | amugeotto moreuneun cheok | Dahyun |
00:00:41,458 | 00:00:43,140 | Baby no more real love | Momo |
00:00:43,140 | 00:00:44,759 | neoye gyeote isseojulge | Dahyun |
00:00:44,759 | 00:00:46,702 | majimaken break your heart | Momo |
00:00:46,702 | 00:00:53,764 | Bad boy bad boy, yeah you really make me, a mad girl mad girl, wo oh oh | Jeongyeon |
the succession pairs are (Mina, Dahyun), (Dahyun, Momo), (Momo, Dahyun), (Dahyun, Momo), (Momo,Jeongyeon) since their lines occur adjacent to each other.
For this post, we are counting the succession pairs irrespective of order, i.e., a (Dahyun, Momo) and (Momo, Dahyun) will be considered the same succession pair. Thus, since there are 9 TWICE members, we expect $ {}_9 \mathrm{ C }_2 = $ 36 possible succession pairs all in all.
Using the line successions we detect from the dataset, we could ask:
-
When member 1 sings, who is most often expected to follow next?
-
Which succession pairs occur most often and thus form the backbone of a TWICE song?
-
Which succession pairs occur least often and may thus may be further explored by TWICE’s upcoming songs?
What can we say about the chart?*#
- Discussion valid as of 2020. For recent years, counts increased proportionally to the values below, so the insights are still mostly valid
- The most frequently occurring succesion pairs from the dataset are:
Member1 | Member2 | Count |
---|---|---|
Dahyun | Chaeyoung | 41 |
Jihyo | Nayeon | 39 |
Sana | Tzuyu | 25 |
Jihyo | Jeongyeon | 25 |
Jihyo | Dahyun | 23 |
Coming on the top spot is (Dahyun, Chaeyoung), TWICE’s iconic rap duo.
Next and of almost similar magnitude is (Jihyo, Nayeon) who are the group’s main vocalists and also the two members which has the most lines.
The next three are almost a a third less frequent compared from the top 2 succession pairs:
- the subvocalists (Sana, Tzuyu) mostly during bridge and verse lines;
- main vocals (Jihyo, Jeongyeon) transition during prechorus or last chorus;
- and a (Jihyo, Dahyun) that represents a chorus-rap or rap-chorus transition.
- On the other hand, the least frequently occurring succesion pairs from the dataset are:
Member1 | Member2 | Count |
---|---|---|
Momo | Chaeyoung | 5 |
Jihyo | Chaeyoung | 5 |
Tzuyu | Jeongyeon | 5 |
Dahyun | Tzuyu | 5 |
Nayeon | Jeongyeon | 5 |
Sana | Jeongyeon | 2 |
This is interesting for me since most of these pairs happened during more recent years– and could indicate how the group is making a move to diversify its song structure.
Interestingly, all 2 counts of the rarest succession pair (Sana, Jeongyeon) happened in one song, Heartshaker (2017). This is not to say that the two haven’t transitioned lines with each other (as they may have had in their Japanese singles or track songs), but it is the case for their Korean singles.
- What if we count the total number of succession pairs per song? the unique succession pairs per song?
According to this plot, Heart Shaker (2017) had the most total and unique successions with 69 and 27 pairs respectively, and TWICE listeners would verify that this song had a lot of transitions, especially in the verses.
In contrast, we have Feel Special (2019) with the least total successions at 11 pairs(10 of which are unique) because for this song, each member sings entire verses on their own.
An interesting case is I Can’t Stop Me (2020), with half less total succession pairs (34 pairs) than Heart Shaker (2017) but 23 of which are unique pairs. If we take only the songs with has at least 30 succession pairs, then the former can be considered the most diverse TWICE korean single so far.
Where did the data come from?#
Since TWICE is pretty popular, I expected to find a table somewhere in the web that contains this kind of data but unfortunately, I havent found any so I decided to make my own.
This would normally be a transcription exercise that would take up a lot of time and energy, but I think I found a way to lazily do it with help of some programming.
-
First, I obtained timing data from subtitles (.srt) of TWICE music videos in Youtube. This is simple–we just need to extract the .srt subtitle file from the video (there are lots of online applets for this!) and read the file in python to get the start and end timestamps of each subtitle line.
-
Next, I got the song lyrics by scraping blog entries from colorcodedlyrics.com. Because the lines are already color-tagged, we could map a member with her part a quite easily.
Whenever a line assignment is seemingly erroneous in the color coded lyrics website, I go back to the music video and/or performance videos to be able to correct it if necessary. (I also admit that in the 3 years I’ve been listening to the group, I’ve already developed an ear to detect each of their voices hehe)
- Here’s the hard part: since lyrics are spliced differently in the subtitles and in the scraped color coded lyrics, I had to merge the lyric and timing tables manually! To further complicate things, I also had to match the English subtitles with what seems to be the corresponding line in the romanized color coded lyrics, so this is also, somehow, an exercise of lingustics skill for me.
It took me 12 hours spread over 3 days to complete the steps above. Its quite tedious, so I had to limit the dataset to include korean singles only for now–although I would be happy to include their Japanese singles and album tracks! If you want to help out, reach me through the links at the bottom of this page.
You may download the dataset from here. Please message me if you find any inconsistencies in the data.
What are the tools I used?#
I used the python BeautifulSoup
for scraping the color coded lyrics, downsub.com to get srts from youtube, python pandas
to read and clean the datasets, python seaborn
to make the bar chart and the javascript library d3
to make the final interactive chord diagram chart, which is provided by Nadieh Bremer in her tutorial here.
Thanks everyone and please watch out for part 2 of this analysis! –JC