#kpop Analysis PART 1: How often do TWICE members succeed each other when singing their Korean songs? #DataViz.

Get the data in Kaggle here

This is Part 1 of a 2-part data analysis post - you may view Part 2 here



UPDATE (Sep 2023): I updated the viz to include recent TWICE songs: Alcohol Free (2021), Scientist (2021), Talk That Talk (2022) and Set Me Free (2023) in the visualization. I also added the special release single--The Best Thing I Ever Did (2018), which is also considered a title track.




As some of you may know, I am a big fan of the South Korean girl group TWICE. I listen to them when I need an energy boost after being drained when coding and they never fail to motivate me. When I created this blog, I promised myself that I’d do a data project based on their songs, and finally I’ve got some results to share!



This is the k-pop girl group TWICE. Members from left to right are Sana, Tzuyu, Nayeon, Momo, Mina, Chaeyoung, Dahyun, Jihyo and Jeongyeon

This is the k-pop girl group TWICE. Members from left to right are Sana, Tzuyu, Nayeon, Momo, Mina, Chaeyoung, Dahyun, Jihyo and Jeongyeon


Below is the completed visualization for the first part of this data science project. This is a chord diagram of TWICE’s line successions in their Korean singles. As data-to-viz.com defines it, a chord diagram shows connections between several objects represented by sectors on an outer circle. Arcs are drawn from one sector to another if the objects are related to each other, and the size of the arc is proportional to the importance of the relation.

I invite you to hover on the arcs and sectors and check out whats written on them!


Looks neat, doesnt it? Lets break it down below to see how its made.

What is line succession?#

So what data could we gather from their songs? The most obvious would be to measure the amount of time each member spends when singing their lines in each of their songs. This is often visualized by a bar/pie chart of total singing duration per member.

A line distribution chart of TWICE songs from 2015-2019 by Mikayla Berry on Koreaboo. You may view the full post here

A line distribution chart of TWICE songs from 2015-2019 by Mikayla Berry on Koreaboo. You may view the full post here

Fellow fans can really get creative when visualizing this–you could see their work by searching for “line distribution” videos in youtube.

Heated debates often happen when fans see these visualizations.They often go along the lines of: “Why is it that member X always has more lines than the others? Member Y deserves more lines!".

I think this line of thinking is overly simplistic and limiting because:

  • The total singing time of each member, on its own, is not a sufficient basis to quantify a member’s “relevance” in a group. For example, members have group roles (e.g. main/lead dancer) which may not translate to more singing lines.

  • At this level, nothing is revealed about the group’s creative dynamics that manifest in each song. What determines the order of which the members would sing? How does the group spice up their act by introducing variation and contrast? And why their songs have a unique TWICE flavor that distinguishes them from other k-pop girl group songs?


Like most art forms, a song is greater than the sum of its parts, and I want to capture that complexity. So instead of line distribution, I thought I should analyze
TWICE's line successionhow song lines are transferred from one member to another



So instead of individual lines, we now count succession pairs, defined as any two members whose lines come after one another in one song.

To illustrate, using our dataset over this section of TWICE’s song Cry for Me (2020):

start_time end_time line vocal1
00:00:25,770 00:00:39,823 You don’t know me L O V E or hatred, ibyeol daeshin nan sunjinhan misoman oneuldo ne pume angillae Mina
00:00:39,823 00:00:41,458 amugeotto moreuneun cheok Dahyun
00:00:41,458 00:00:43,140 Baby no more real love Momo
00:00:43,140 00:00:44,759 neoye gyeote isseojulge Dahyun
00:00:44,759 00:00:46,702 majimaken break your heart Momo
00:00:46,702 00:00:53,764 Bad boy bad boy, yeah you really make me, a mad girl mad girl, wo oh oh Jeongyeon

the succession pairs are (Mina, Dahyun), (Dahyun, Momo), (Momo, Dahyun), (Dahyun, Momo), (Momo,Jeongyeon) since their lines occur adjacent to each other.

For this post, we are counting the succession pairs irrespective of order, i.e., a (Dahyun, Momo) and (Momo, Dahyun) will be considered the same succession pair. Thus, since there are 9 TWICE members, we expect $ {}_9 \mathrm{ C }_2 = $ 36 possible succession pairs all in all.

Using the line successions we detect from the dataset, we could ask:

  • When member 1 sings, who is most often expected to follow next?

  • Which succession pairs occur most often and thus form the backbone of a TWICE song?

  • Which succession pairs occur least often and may thus may be further explored by TWICE’s upcoming songs?

What can we say about the chart?*#

  • Discussion valid as of 2020. For recent years, counts increased proportionally to the values below, so the insights are still mostly valid
  1. The most frequently occurring succesion pairs from the dataset are:
Member1 Member2 Count
Dahyun Chaeyoung 41
Jihyo Nayeon 39
Sana Tzuyu 25
Jihyo Jeongyeon 25
Jihyo Dahyun 23

Coming on the top spot is (Dahyun, Chaeyoung), TWICE’s iconic rap duo.

Next and of almost similar magnitude is (Jihyo, Nayeon) who are the group’s main vocalists and also the two members which has the most lines.

The next three are almost a a third less frequent compared from the top 2 succession pairs:

  • the subvocalists (Sana, Tzuyu) mostly during bridge and verse lines;
  • main vocals (Jihyo, Jeongyeon) transition during prechorus or last chorus;
  • and a (Jihyo, Dahyun) that represents a chorus-rap or rap-chorus transition.
  1. On the other hand, the least frequently occurring succesion pairs from the dataset are:
Member1 Member2 Count
Momo Chaeyoung 5
Jihyo Chaeyoung 5
Tzuyu Jeongyeon 5
Dahyun Tzuyu 5
Nayeon Jeongyeon 5
Sana Jeongyeon 2

This is interesting for me since most of these pairs happened during more recent years– and could indicate how the group is making a move to diversify its song structure.

Interestingly, all 2 counts of the rarest succession pair (Sana, Jeongyeon) happened in one song, Heartshaker (2017). This is not to say that the two haven’t transitioned lines with each other (as they may have had in their Japanese singles or track songs), but it is the case for their Korean singles.

  1. What if we count the total number of succession pairs per song? the unique succession pairs per song?

Total line succession pairs per song

According to this plot, Heart Shaker (2017) had the most total and unique successions with 69 and 27 pairs respectively, and TWICE listeners would verify that this song had a lot of transitions, especially in the verses.

In contrast, we have Feel Special (2019) with the least total successions at 11 pairs(10 of which are unique) because for this song, each member sings entire verses on their own.

An interesting case is I Can’t Stop Me (2020), with half less total succession pairs (34 pairs) than Heart Shaker (2017) but 23 of which are unique pairs. If we take only the songs with has at least 30 succession pairs, then the former can be considered the most diverse TWICE korean single so far.

Where did the data come from?#

Since TWICE is pretty popular, I expected to find a table somewhere in the web that contains this kind of data but unfortunately, I havent found any so I decided to make my own.

This would normally be a transcription exercise that would take up a lot of time and energy, but I think I found a way to lazily do it with help of some programming.

  1. First, I obtained timing data from subtitles (.srt) of TWICE music videos in Youtube. This is simple–we just need to extract the .srt subtitle file from the video (there are lots of online applets for this!) and read the file in python to get the start and end timestamps of each subtitle line.

  2. Next, I got the song lyrics by scraping blog entries from colorcodedlyrics.com. Because the lines are already color-tagged, we could map a member with her part a quite easily.

Whenever a line assignment is seemingly erroneous in the color coded lyrics website, I go back to the music video and/or performance videos to be able to correct it if necessary. (I also admit that in the 3 years I’ve been listening to the group, I’ve already developed an ear to detect each of their voices hehe)

  1. Here’s the hard part: since lyrics are spliced differently in the subtitles and in the scraped color coded lyrics, I had to merge the lyric and timing tables manually! To further complicate things, I also had to match the English subtitles with what seems to be the corresponding line in the romanized color coded lyrics, so this is also, somehow, an exercise of lingustics skill for me.

It took me 12 hours spread over 3 days to complete the steps above. Its quite tedious, so I had to limit the dataset to include korean singles only for now–although I would be happy to include their Japanese singles and album tracks! If you want to help out, reach me through the links at the bottom of this page.

You may download the dataset from here. Please message me if you find any inconsistencies in the data.

What are the tools I used?#

I used the python BeautifulSoup for scraping the color coded lyrics, downsub.com to get srts from youtube, python pandas to read and clean the datasets, python seaborn to make the bar chart and the javascript library d3 to make the final interactive chord diagram chart, which is provided by Nadieh Bremer in her tutorial here.





Thanks everyone and please watch out for part 2 of this analysis! –JC