Objective Scene graph generation is a critical task in computer vision, enabling a comprehensive and deep understanding of visual scenes. It focuses on identifying entities and the relationships between them, ultimately requiring the model to output a series of triplets (subject-predicate-object) and a graph-structured scene representation. This places greater demands on the model's understanding and reasoning capabilities. Although existing scene graph generation methods have achieved substantial success, most models are hindered by either an excessive number of parameters or inaccurate predicate judgments. This study proposes an end-to-end rough-and-refine model (RRM) for scene graph generation to overcome these challenges. Methods The end-to-end rough-and-refine network model proposed in this study for scene graph generation consisted of two components: the rough part and the refine part, which were responsible for predicting and updating entities and their relationships, respectively. In the rough part, image features were initially extracted using convolutional neural networks and a Transformer encoder. These features were then input alongside entity queries into the entity decoder for self-attention computation, resulting in preliminary entity representations. In addition, a predicate decoder was designed to follow the entity decoder and generate predictions for predicates. Predicting relationships between entities requires considering both entity information and image feature information comprehensively. Therefore, the predicate decoder took image features, entity representations, and predicate queries as inputs. Specifically, entity representations were integrated into the predicate queries, followed by further attention computations in the predicate decoder to obtain predicate representations. Through the rough part, the system gained a preliminary perception and predictive capability regarding the scene. However, due to a lack of information interaction between entities, this stage struggled to excavate deeper semantic information. In addition, ambiguity remained in distinguishing between subjects and objects, necessitating the design of the refine part to enhance performance. In the refine part, a triplet query generation module was first established to support subsequent calculations for triplet prediction. Since a triplet required the prediction of three distinct types of information, subject, object, and predicate, three paths were designed: the subject path, object path, and predicate path. In each path, the model incorporated image features, entity representations, and relationship representations derived from the rough part, utilizing cross-attention computations to integrate different information. In addition, the predictions for subjects and objects were fused with the predicate information to enhance the representational capacity of the predicate component. This design allowed the model to more thoroughly consider the states of entity pairs during relationship prediction, fostering a deeper understanding of the interactions between subjects and objects. After the model completed the representation of triplets, it was required to produce specific prediction results. The subject and object needed to predict their categories along with location information represented by bounding boxes, which included normalized center coordinates (x, y) and the dimensions of the bounding boxes (length and width). In contrast, the predicate only required the prediction of its category. Predictions for the different paths of the triplet were independently executed using feedforward neural networks. Each feedforward neural network consisted of two perceptrons with ReLU activation functions and a linear projection layer, facilitating both category classification and bounding box regression. Results and Discussions Several commonly used metrics in this research domain were employed, including Recall@K (R@K) and Mean Recall@K (mR@K) to evaluate the performance of RRM in the scene graph generation task. R@K reflected the overall recall rate of the model on the dataset, measuring whether the top-k predicted triplets can be found among the true labeled triplets. In contrast, the mR@K metric calculated an R@K for each predicate category and then computed the average. This evaluation metric placed greater emphasis on the model's ability to learn low-frequency predicate categories within the dataset, ensuring that infrequent predicates received equal importance as frequent ones. This was particularly critical in addressing the long-tail distribution problem present in the dataset, as it demonstrated the model's learning capability across all predicate categories. The proposed method, RRM, achieved superior R@K results among single-stage methods, outperforming other single-stage approaches in the R@20, R@50, and R@100 metrics. Specifically, the RRM model achieved R@20 = 23.8, R@50 = 29.1, and R@100 = 32.5, which were higher than the optimal values of other single-stage methods by 2.6, 1.6, and 2.4, respectively. The mR@K metrics exceeded those of FCSGG, HOTR, RelTR, and most two-stage methods, reaching mR@20 = 7.7, mR@50 = 11.0, and mR@100 = 12.4. In a vertical comparison, the model significantly outperformed FCSGG and HOTR, and also demonstrated better performance across the six evaluation metrics, R@K and mR@K, compared to RelTR, although RelTR has a smaller parameter count. When comparing SGTR and SGTR+, the model performed better in terms of R@K, mR@20, and parameter count, while SGTR and SGTR+ exhibited better results in mR@50 and mR@100. In ablation experiments, the results indicated that each module made a positive contribution to the prediction of scene graphs, with the removal of any single module leading to a decline in the experimental results. The FMI and EMI modules have a significant impact on the model; removing either FMI or EMI resulted in an average decrease of 11.7% and 8.9%, respectively, as these modules introduced crucial scene information. The TQG and EPR modules also provided measurable improvement, with average decreases of 4.7% and 5.2% when removed. The model represented in the first row of the table, which excluded all four modules, was equivalent to the rough part, showing an average decrease of 24.9%. Conclusions A scene graph generation method based on a rough-and-refine network is proposed to address the challenge of inadequate predicate representation. Experimental results demonstrate that the proposed network model achieves strong performance on public datasets, surpassing existing models across several key evaluation metrics and enabling the accurate extraction of information from images to scene graphs. Visualization experiments conducted in diverse scenarios confirm the model's capability in scene graph generation and highlight the performance improvements provided by the refine model over the rough model.
DuanJingwen, MinWeidong, YangZiyuan,et al.Global semantic information extraction based scene graph generation algorithm[J].Journal of Image and Graphics,2022,27(7):2214‒2225.
DaiBo, ZhangYuqi, LinDahua.Detecting visual relationships with deep relational networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE,2017:3298‒3308. doi:10.1109/cvpr.2017.352
[5]
AmodeoF, CaballeroF, Díaz‒RodríguezN,et al.OG‒SGG:Ontology-guided scene graph generation—A case study in transfer learning for telepresence robotics[J].IEEE Access,2022,10:132564‒132583. doi:10.1109/access.2022.3230590
[6]
JungJ, ParkJ.Visual relationship detection with language prior and softmax[C]//Proceedings of the 2018 IEEE International Conference on Image Processing,Applications and Systems.Sophia Antipolis:IEEE,2018:143‒148. doi:10.1109/IPAS.2018.8708855
[7]
LiaoWentong, RosenhahnB, LingShuai,et al.Natural language guided visual relationship detection[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.Long Beach:IEEE,2019:444‒453. doi:10.1109/cvprw.2019.00058
[8]
YuJ, ChaiY, WangY,et al.CogTree: Cognition Tree Loss for Unbiased Scene Graph Generation[C]//International Joint Conference on Artificial Intelligence.International Joint Conferences on Artificial Intelligence Organization, 2021. doi:10.24963/ijcai.2021/176
[9]
ZareianA, KaramanS, ChangS F.Bridging knowledge graphs to generate scene graphs[M]//Computer Vision‒ECCV 2020.Cham:Springer International Publishing,2020:606‒623. doi:10.1007/978-3-030-58592-1_36
[10]
WangLichun, FuFangyu, XuKai,et al.Scene graph generation method based on dual-stream multi-head attention[J].Journal of Beijing University of Technology,2024,50(10):1198‒1205.
ZhangLiang, ZhangShuai, ShenPeiyi,et al.Relationship detection based on object semantic inference and attention mechanisms[C]//Proceedings of the 2019 on International Conference on Multimedia Retrieval.Ottawa:ACM,2019:68‒72. doi:10.1145/3323873.3325025
[13]
DhingraN, RitterF, KunzA.BGT-net:Bidirectional GRU transformer network for scene graph generation[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.Nashville:IEEE,2021:2150‒2159. doi:10.1109/cvprw53098.2021.00244
[14]
YinGuojun, ShengLu, LiuBin,et al.Zoom-net:Mining deep feature interactions for visual relationship recognition[C]//Computer Vision‒ECCV 2018.Cham:Springer,2018:330‒347. doi:10.1007/978-3-030-01219-9_20
[15]
LiYikang, OuyangWanli, WangXiaogang,et al.ViP‒CNN:Visual phrase guided convolutional neural network[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE,2017:7244‒7253. doi:10.1109/cvpr.2017.766
[16]
ZellersR, YatskarM, ThomsonS,et al.Neural motifs:Scene graph parsing with global context[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:5831‒5840. doi:10.1109/cvpr.2018.00611
[17]
GuJiuxiang, ZhaoHandong, LinZhe,et al.Scene graph generation with external knowledge and image reconstruction[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Long Beach:IEEE,2019:1969‒1978. doi:10.1109/cvpr.2019.00207
[18]
QiMengshi, LiWeijian, YangZhengyuan,et al.Attentive relational networks for mapping images to scene graphs[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Long Beach:IEEE,2019:3952‒3961. doi:10.1109/cvpr.2019.00408
[19]
ZhangLiang, ZhangShuai, ShenPeiyi,et al.Relationship detection based on object semantic inference and attention mechanisms[C]//Proceedings of the 2019 on International Conference on Multimedia Retrieval.Ottawa:ACM,2019:68‒72. doi:10.1145/3323873.3325025
[20]
CongYuren, YangM Y, RosenhahnB.RelTR:Relation transformer for scene graph generation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2023,45(9):11169‒11183. doi:10.1109/tpami.2023.3268066
[21]
LiRongjie, ZhangSongyang, HeXuming.SGTR:End-to-end scene graph generation with transformer[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE,2022:19464‒19474. doi:10.1109/cvpr52688.2022.01888
[22]
LiRongjie, ZhangSongyang, HeXuming.SGTR:End-to-end scene graph generation with transformer[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2024,46(4):2191‒2205. doi:10.1109/tpami.2023.3332246
[23]
KhandelwalS, SigalL.Iterative scene graph generation[J].Advances in Neural Information Processing Systems,2022,35:24295‒24308. doi:10.1109/iccv48922.2021.01558
[24]
CarionN, MassaF, SynnaeveG,et al.End-to-end object detection with transformers[M]//Computer Vision‒ECCV 2020.Cham:Springer International Publishing,2020:213‒229. doi:10.1007/978-3-030-58452-8_13
[25]
XuDanfei, ZhuYuke, ChoyC B,et al.Scene graph generation by iterative message passing[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE,2017:3097‒3106. doi:10.1109/cvpr.2017.330
[26]
TangKaihua, ZhangHanwang, WuBaoyuan,et al.Learning to compose dynamic tree structures for visual contexts[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Long Beach:IEEE,2019:6612‒6621. doi:10.1109/cvpr.2019.00678
[27]
ZhangHanwang, KyawZ, ChangS F,et al.Visual translation embedding network for visual relation detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE,2017:3107‒3115. doi:10.1109/cvpr.2017.331
[28]
ZhangJi, ShihK J, ElgammalA,et al.Graphical contrastive losses for scene graph parsing[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Long Beach:IEEE,2019:11527‒11535. doi:10.1109/cvpr.2019.01180
[29]
LinXin, DingChangxing, ZengJinquan,et al.GPS-net:Graph property sensing network for scene graph generation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle:IEEE,2020:3743‒3752. doi:10.1109/cvpr42600.2020.00380
[30]
ZareianA, KaramanS, ChangS F.Bridging knowledge graphs to generate scene graphs[M]//Computer Vision‒ECCV 2020.Cham:Springer International Publishing,2020:606‒623. doi:10.1007/978-3-030-58592-1_36
[31]
DhingraN, RitterF, KunzA.BGT-net:Bidirectional GRU transformer network for scene graph generation[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.Nashville:IEEE,2021:2150‒2159. doi:10.1109/cvprw53098.2021.00244
[32]
JungD, KimS, KimW H,et al.Devil's on the edges:Selective quad attention for scene graph generation[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Vancouver:IEEE,2023:18664‒18674. doi:10.1109/cvpr52729.2023.01790
[33]
LiuHengyue, YanNing, MortazaviM,et al.Fully convolutional scene graph generation[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE,2021:11541‒11551. doi:10.1109/cvpr46437.2021.01138
[34]
KimB, LeeJ, KangJ,et al.HOTR:End-to-end human-object interaction detection with transformers[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE,2021:74‒83. doi:10.1109/cvpr46437.2021.00014
[35]
TangKaihua, NiuYulei, HuangJianqiang,et al.Unbiased scene graph generation from biased training[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle:IEEE,2020:3713‒3722. doi:10.1109/cvpr42600.2020.00377
[36]
BiswasB A, JiQiang.Probabilistic debiasing of scene graphs[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Vancouver:IEEE,2023:10429‒10438. doi:10.1109/cvpr52729.2023.01005
[37]
MenonA K, JayasumanaS, RawatA S,et al.Long-tail learning via logit adjustment[C]//International Conference on Learning Representations.2021.